LNCS 3021 



Tomas Pajdla 
Jiri Matas (Eds.) 



Computer Vision - 
ECCV 2004 



8th European Conference on Computer Vision 
Prague, Czech Republic, May 2004 
Proceedings, Part I 






3021 



Lecture Notes in Computer Science 

Commenced Publication in 1973 
Founding and Former Series Editors: 

Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen 

Editorial Board 

Takeo Kanade 

Carnegie Mellon University, Pittsburgh, PA, USA 
Josef Kittler 

University of Surrey, Guildford, UK 
Jon M. Kleinberg 

Cornell University, Ithaca, NY, USA 
Friedemann Mattern 

ETH Zurich, Switzerland 
John C. Mitchell 

Stanford University, CA, USA 
Oscar Nierstrasz 

University of Bern, Switzerland 
C. Pandu Rangan 

Indian Institute of Technology, Madras, India 
Bernhard Steffen 

University of Dortmund, Germany 
Madhu Sudan 

Massachusetts Institute of Technology, MA, USA 
Demetri Terzopoulos 

New York University, NY, USA 
Doug Tygar 

University of California, Berkeley, CA, USA 
Moshe Y. Vardi 

Rice University, Houston, IX, USA 
Gerhard Weikum 

Max-Planck Institute of Computer Science, Saarbruecken, Germany 



Springer 

Berlin 
Heidelberg 
New York 
Hong Kong 
London 
Milan 
Paris 
Tokyo 



Tomas Pajdla Jin Matas (Eds.) 



Computer Vision - 
ECCV 2004 



8th European Conference on Computer Vision 
Prague, Czech Republic, May 11-14, 2004 
Proceedings, Part I 




Springer 



Volume Editors 



Tomas Pajdla 
Jin' Matas 

Czech Technical University in Prague, Department of Cybernetics 
Center for Machine Perception 
121-35 Prague 2, Czech Republic 
E-mail: {pajdla, matas} @cmp. felk.cvut.cz 



Library of Congress Control Number: 2004104846 



CR Subject Classification (1998): 1.4, 1.3.5, 1.5, 1.2.9-10 
ISSN 0302-9743 

ISBN 3-540-21984-6 Springer- Verlag Berlin Heidelberg New York 



This work is subject to copyright. All rights are reserved, whether the whole or part of the material is 
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, 
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication 
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, 
in its current version, and permission for use must always be obtained from Springer- Verlag. Violations are 
liable to prosecution under the German Copyright Law. 

Springer- Verlag is a part of Springer Science+Business Media 

springeronline.com 

(c) Springer-Verlag Berlin Heidelberg 2004 
Printed in Germany 

Typesetting: Camera-ready by author, data conversion by PTP-Berlin, Protago-TeX-Production GmbH 
Printed on acid-free paper SPIN: 1 1007678 06/3142 5 4 3 2 1 0 



Preface 



Welcome to the proceedings of the 8th European Conference on Computer Vi- 
sion! 

Following a very successful ECCV 2002, the response to our call for papers 
was almost equally strong - 555 papers were submitted. We accepted 41 papers 
for oral and 149 papers for poster presentation. 

Several innovations were introduced into the review process. First, the num- 
ber of program committee members was increased to reduce their review load. 
We managed to assign to program committee members no more than 12 papers. 
Second, we adopted a paper ranking system. Program committee members were 
asked to rank all the papers assigned to them, even those that were reviewed 
by additional reviewers. Third, we allowed authors to respond to the reviews 
consolidated in a discussion involving the area chair and the reviewers. Fourth, 
the reports, the reviews, and the responses were made available to the authors as 
well as to the program committee members. Our aim was to provide the authors 
with maximal feedback and to let the program committee members know how 
authors reacted to their reviews and how their reviews were or were not reflected 
in the final decision. Finally, we reduced the length of reviewed papers from 15 
to 12 pages. 

The preparation of ECCV 2004 went smoothly thanks to the efforts of the or- 
ganizing committee, the area chairs, the program committee, and the reviewers. 
We are indebted to Anders Heyden, Mads Nielsen, and Henrik J. Nielsen for 
passing on ECCV traditions and to Dominique Asselineau from ENST/TSI who 
kindly provided his GestRFIA conference software. We thank Jan-Olof Eklundh 
and Andrew Zisserman for encouraging us to organize ECCV 2004 in Prague. 
Andrew Zisserman also contributed many useful ideas concerning the organiza- 
tion of the review process. Olivier Faugeras represented the ECCV Board and 
helped us with the selection of conference topics. Kyros Kutulakos provided hel- 
pful information about the CVPR 2003 organization. David Vernon helped to 
secure EC Vision support. 

This conference would never have happened without the support of the 
Centre for Machine Perception of the Czech Technical University in Prague. 
We would like to thank Radim Sara for his help with the review process and 
the proceedings organization. We thank Daniel Vecerka and Martin Matousek 
who made numerous improvements to the conference software. Petr Pohl helped 
to put the proceedings together. Martina Budosova helped with administrative 
tasks. Hynek Bakstein, Ondrej Chum, Jana Kostkova, Branislav Micusik, Stepan 
Obdrzalek, Jan Sochman, and Vi't Zyka helped with the organization. 
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Abstract. We present an analytic solution to the problem of estimating multiple 
2-D and 3-D motion models from two-view correspondences or optical flow. The 
key to our approach is to view the estimation of multiple motion models as the 
estimation of a single multibody motion model. This is possible thanks to two im- 
portant algebraic facts. First, we show that all the image measurements, regardless 
of their associated motion model, can be fit with a real or complex polynomial. 
Second, we show that the parameters of the motion model associated with an im- 
age measurement can be obtained from the derivatives of the polynomial at the 
measurement. This leads to a novel motion segmentation algorithm that applies 
to most of the two- view motion models adopted in computer vision. Our experi- 
ments show that the proposed algorithm outperforms existing algebraic methods in 
terms of efficiency and robustness, and provides a good initialization for iterative 
techniques, such as EM, which is strongly dependent on correct initialization. 



1 Introduction 

A classic problem in visual motion analysis is to estimate a motion model for a set of 
2-D feature points as they move in a video sequence. Ideally, one would like to fit a 
single model that describes the motion of all the features. In practice, however, different 
regions of the image obey different motion models due to depth discontinuities, perspec- 
tive effects, multiple moving objects, etc. Therefore, one is faced with the problem of 
fitting multiple motion models to the image, without knowing which pixels are moving 
according to the same model. More specifically: 

Problem 1 ( Multiple-motion estimation and segmentation). Given a set of image mea- 
surements {(arj, * 2 )}j^i taken from two views of a motion sequence related by a col- 
lection ofn (n known ) 2-D or 3-D motion models { Mi }” =1 , estimate the motion models 
without knowing which image measurements correspond to which motion model. 

Related literature. There is a rich literature addressing the 2-D motion segmentation 
problem using the so-called layered representation [1] or different variations of the 
Expectation Maximization (EM) algorithm [2,3,4]. These approaches alternate between 
the segmentation of the image measurements (E-step) and the estimation of the motion 

* The authors thank Jacopo Piazzi and Frederik Schaffalitzky for fruitful discussions. Research 
funded with startup funds from the departments of BME at Johns Hopkins and ECE at UIUC. 
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parameters (M-step) and suffer from the disadvantage that the convergence to the optimal 
solution strongly depends on correct initialization [5,6] . Existing initialization techniques 
estimate the motion parameters from local patches and cluster these motion parameters 
using K-means [7], normalized cuts [5], or a Bayesian version of RANSAC [6]. The only 
existing algebraic solution to 2-D motion segmentation is based on bi-homogeneous 
polynomial factorization and can be found in [9]. 

The 3-D motion segmentation problem has received relatively less attention. Existing 
approaches include combinations of EM with normalized cuts [8] and factorization 
methods for orthographic and affine cameras [10,11]. Algebraic approaches based on 
polynomial and tensor factorization have been proposed in the case of multiple translating 
objects [12] and in the case of two [13] and multiple [14] rigid-body motions. 

Our contribution. In this paper, we address the initialization of iterative approaches 
to motion estimation and segmentation by proposing a non-iterative algebraic solution 
to Problem 1 that applies to most 2-D and 3-D motion models in computer vision, as 
detailed in Table 1 . The key to our approach is to view the estimation of multiple motion 
models as the estimation of a single, though more complex, multibody motion model that 
is then factored into the original models. This is achieved by (1 ) eliminating the feature 
segmentation problem in an algebraic fashion, (2) fitting a single multibody motion 
model to all the image measurements, and (3) segmenting the multibody motion model 
into its individual components. More specifically, our approach proceeds as follows: 

1. Eliminate Feature Segmentation : Find an algebraic equation that is satisfied by 
all the image measurements, regardless of the motion model associated with each 
measurement. For the motion models considered in this paper, the i th motion model 
will be typically defined by an algebraic equation of the form f(xi,x 2 , A1,) = 0. 
Therefore an algebraic equation that is satisfied by all the data is 

g(x 1 ,x 2 ,M) = f(x 1 ,x 2 ,M 1 )f(x 1 ,x 2 ,M 2 ) ■■■f(x 1 , x 2l M n ) = 0. (1) 

Such an equation represents a single multibody motion model whose parameters A1 
encode those of the original motion models {A 4 j}” =1 . 

2. Multibody Motion Estimation: Estimate the parameters Af of the multibody motion 
model from the given image measurements. For the motion models considered in 
this paper, the parameters A1 will correspond to the coefficients of a real or complex 
polynomial p n of degree n. We will show that if n is known such parameters can be 
estimated linearly after embedding the image data into a higher-dimensional space. 

3. Motion Segmentation: Recover the parameters of the original motion models from 
the parameters of the multibody motion model Al, i.e. 

i. (2) 

We will show that the individual motion parameters Al* can be computed from the 
derivatives of p n evaluated at a collection of n image measurements. 

This new approach offers two important technical advantages over previously known 
algebraic solutions to the segmentation of 3-D translational [12] and rigid-body motions 
(fundamental matrices) [14] based on homogeneous polynomial factorization: 



A Unified Algebraic Approach to 2-D and 3-D Motion Segmentation 



3 



Table 1 . 2-D and 3-D motion models considered in this paper 



Motion models 


Model equations 


Model parameters 


Equivalent to clustering 


2-D translational 
2-D similarity 

2-D affine 


X2 = X i 

X2 = A it 
X 2 = Ai 


+ Ti 

liX 1 + Ti 
xf\ 

1 


{T £ R 2 }r=i 

{(fl,,T,)e5£;(2),A i ef + }" = i 
{Ai e r 2x3 }[U 


Hyperplanes in C 2 
Hyperplanes in C 3 

Hyperplanes in C 4 


3-D translational 
3-D rigid-body 
3-D homography 


0 = *2 
0 = x\ 
X2 ~ Hit 


Ti] x xi 
iX 1 
Cl 


{Ti £ R 3 }?=r 

{Fi £ R 3x3 :rank(Fi) = 2}” =1 
{H, £ R 3x3 }?=i 


Hyperplanes in R 3 
Bilinear forms inR 3x3 
Bilinear forms inC 2x3 



1 . It is based on polynomial differentiation rather than polynomial factorization, which 
greatly improves the efficiency, accuracy and robustness of the algorithm. 

2. It applies to either feature correspondences or optical flows and includes most of 
the two-view motion models in computer vision: 2-D translational, similarity, and 
affine, or 3-D translational, rigid body motions (fundamental matrices), or motions 
of planar scenes (homographies), as shown in Table 1. The unification is achieved 
by embedding some of the motion models into the complex domain, which resolves 
cases such as 2-D affine motions and 3-D homographies that could not be solved in 
the real domain. 

With respect to extant probabilistic methods, our approach has the advantage that it 
provides a global, non-iterative solution that does not need initialization. Therefore, our 
method can be used to initialize any iterative or optimization based technique, such as 
EM, or else in a layered (multiscale) or hierarchical fashion at the user’s discretion. 



Noisy image data. Although the derivation of the algorithm will assume noise free data, 
the algorithm is designed to work with moderate noise, as we will soon point out. 

Notation. Let z be a vector in R A or C K and let z T be its transpose. A homogeneous 
polynomial of degree n in z is a polynomial p n (z) such that p n (Az) = A n p n (z) for 
all A in R or C. The space of all homogeneous polynomials of degree n in K variables, 

R n (K), is a vector space of dimension M n (K) = ^ ^ ^ ^ . A 

particular basis for R n (K ) is obtained by considering all the monomials of degree n 
in K variables, that is z 1 = z^zj 2 • • • z^ K with 0 < rij < n for j = 1, . . . , K, and 
n-i + 71-2 + • • • + riK = n. Therefore, each polynomial p n (z) £ R n (K) can be written 
as a linear combination of a vector of coefficients c £ or C Mn ^ K ' ) as 

p n (z) = c T v n (z) = Y^ un2 ,..., nK zTzT ■ ■ ■ z k K i ( 3 ) 

where v n : M A (C A ") — >.]R M '»( Ar )(C Mn ^ if ^) is the Veronese map of degree n [12] defined 
as v n : [zi , . . . , Zk\ ,z 7 ,...] t with I chosen in the degree-lexicographic order. 
The Veronese map is also known as the polynomial embedding in the machine learning 
community. 
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2 2-D Motion Segmentation by Clustering Hyperplanes in C K 



2.1 Segmentation of 2-D Translational Motions: Clustering Hyperplanes in C 2 

The ease of feature points. Under the 2-D translational motion model the two images 
are related by one out of n possible 2-D translations {7, £ R 2 }™ =1 - That is, for each 
feature pair x\ £ R 2 and x 2 £ R 2 there exists a 2-D translation Ti £ R 2 such that 



x 2 = x 1 + Ti. 



(4) 



Therefore, if we interpret the displacement of the features ( x 2 — x t ) and the 2-D trans- 
lations Ti as complex numbers (x 2 — x\) £ C and 7’ £ C, then we can re-write 
equation (4) as 



bfz=[Ti 1] 



[x 2 - x{) 



= 0 £ C 2 



(5) 



The above equation corresponds to a hyperplane in C 2 whose normal vector b, encodes 
the 2-D translational motion Ti. Therefore, the segmentation of n 2-D translational 
motions {Ti £ R 2 }" =1 from a set of correspondences {x\ £ R 2 and { x 2 £ 
R 2 }^ =1 is equivalent to clustering data points {z- 7 £ C 2 }^^ lying on n complex 
hyperplanes with normal vectors {b,; £ C 2 }" =1 . As we will see in short, other 2-D 
and 3-D motion segmentation problems are also equivalent to clustering data lying 
on complex hyperplanes in C 3 and C 4 . Therefore, rather than solving the hyperplane 
clustering problem for the case K = 2, we now present a solution for hyperplanes in 
C K with arbitrary K by adapting the Generalized PCA algorithm of [15] to the complex 
domain. 



Eliminating feature segmentation. We first notice that each point z £ C K , regardless 
of which motion model {bi £ C K t is associated with it, must satisfy the following 
homogeneous polynomial of degree n in K complex variables 

n 

Pn{z) = Y[{bfz) = ^C/Z J = '52c ni ,... inK z" 1 2% a ■■■Zk K = C T V n (z) = 0, (6) 

1=1 I 

where the coefficient vector c £ C Mn ( K ' ) represents the multibody motion parameters. 



Estimating multibody motion. Since the polynomial p n must be satisfied by all the 
data points Z = {z 7 £ C K }^L 1 , we obtain the following linear system on c 



L n c= 0 €C n , 



(7) 



where L n = [^(z 1 ), is n (z 2 ), . . . , v n (z N )] T £ <C NxMn ^ K ) . One can show that there is 
a unique solution for c (up to a scale factor) if TV > M n ( K ) — 1 and at least K — 1 points 
belong to each hyperplane. Furthermore, since the last entry of each b, is equal to one, 
then so is the last entry of c. Therefore, one can solve for c uniquely. In the presence of 
noise, one can solve for c in a least-squares sense as the singular vector of L n associated 
with its smallest singular value, and then normalize so that CM n (K) = 1. 
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Segmenting the multibody motion. Given c, we now present an algorithm for com- 
puting the motion parameters b, from the derivatives of p n . To this end, we consider the 
derivative of p n (z). 



Dpn(z) 



dpnjz) 

dz 



eii (bjz)b. u 



i— 1 tzfci 



(8) 



and notice that if we evaluate Dp n (z) at a point 2 : = y j that corresponds to the i th 
motion model, i.e. if y i is such that bj y i = 0, then we have Dp n (yf) ~ b,. Therefore, 
given c we can obtain the motion parameters as 



Dp n (z ) 




e^Dp n {z) 


•) 

z =Vi 



(9) 



where ex = [0, . . . , 0, 1] T £ C A and y i £ C K is a nonzero vector such that bjy t = 0. 

The rest of the problem is to find one vector y 1 £ C A in each one of the hyperplanes 
Hi = {z £ C K : bj z = 0} for i = 1, . . . , n. To this end, notice that we can always 
choose a point y n lying on one of the hyperplanes as any of the points in the data set 
Z. However, in the presence of noise and outliers, an arbitrary point in Z may be far 
from the hyperplanes. The question is then how to compute the distance from each data 
point to its closest hyperplane, without knowing the normals to the hyperplanes. The 
following lemma allows us to compute a first order approximation to such a distance: 

Lemma 1. Let 2 £ H, be the projection of a point z £ C K onto its closest hyperplane 
Hi- Also let II = (/ — exejc)- Then the Euclidean distance from z to Hi is given by 



IN-511 



\Pn{z) | 
\\IIDp n (z)\\ 



+ o(IN-5|| 2 ). 



GO) 



Therefore, we can choose a point in the data set close to one of the subspaces as: 



Vn = 



arg min 

,z£ Z 



\Pn{z) | 

\\nDp n {z)\y 



(ID 



and then compute the normal vector at y n as b n = Dp n (y n ) /(e'^Dp n (y rl )). In order to 
find a point y ri _ t in one of the remaining hyperplanes, we could just remove the points 
on H n from Z and compute y n _ 1 similarly to (1 1), but minimizing over Z \ H n , and 
so on. However, the above process is not very robust in the presence of noise. Therefore, 
we propose an alternative solution that penalizes choosing a point from H n in (11) by 
dividing the objective function by the distance from 2 to H n , namely |6^2|/||i7b„||. 
That is, we can choose a point on or close to Ll”^ 1 H, as 



Vn-i = arg™ 



\Pnjz) I 
\\nDp n ( z ] 



\bfz\ 

Il/Ibjl 



+ 5 



( 12 ) 



where <5 > 0 is a small positive number chosen to avoid cases in which both the numerator 
and the denominator are zero (e.g. with perfect data). By repeating this process for the 
remaining hyperplanes, we obtain the following hyperplane clustering algorithm: 
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Algorithm 1 (Clustering hyperplanes in C K ) Given data points Z = {z 1 £ 

solve fore £ < C Mn( ' K ' t from the linear system [^(z 1 ), is n (z 2 ), . . . , i/ n (z N )] T c = 0; 
setp n (z) = c T v n (z); 
for i = n : 1, 



Vi = 



arg mm 
Z 



bn( z )l 
II nD Pn (z] 



\ b T + l z \"-\ b f z \ 



b, 



II nb z 



II nb n 



DPuiVi) 

^DpuiViY 



(13) 



end. 



Notice that one could also choose the points y i in a purely algebraic fashion, e.g., 
by intersecting a random line with the hyperplanes, or else by dividing the polynomial 
p n (z) by bj t z. However, we have chosen to present Algorithm 1 instead, because it has 
a better performance with noisy data and is not very sensitive to the choice of S. 



The case of translational optical flow. Imagine now that rather than a collection of 
feature points we are given the optical flow {uj £ R 2 }^! between two consecutive 
views of a video sequence. If we assume that the optical flow is piecewise constant, i.e. 
the optical flow of every pixel in the image takes only n possible values {7} £ R 2 }" =1 , 
then at each pixel j £ {1, . . . , TV} there exists a motion 7’ : such that 

Uj = T t . (14) 

The problem is now to estimate the n motion models (Tj}” =1 from the optical flow 
if N > M n { 2) — 1 ~ 0(n), this problem can be solved using the same 
technique as in the case of feature points (Algorithm 1 with K = 3) after replacing 

X 2 — X\ = U. 



2.2 Segmentation of 2-D Similarity Motions: Clustering Hyperplanes in C 3 

The case of feature points. In this case, we assume that for each feature point (x ± , xf) 
there exists a 2-D rigid-body motion (Ri, Tf) £ SE( 2) and a scale Aj £ R + such that 



X2 — KRiXi T / ■' — Aj 



cos (Of) — sin(0j) 
sin(0j) cos {Of) 



x\ + Tj. 



(15) 



Therefore, if we interpret the rotation matrix as a unit number Ri = exp>{9is/—l) £ 
C, and the translation vector and the image features as points in the complex plane 
Ti, x i,X 2 £ C, then we can write the 2-D similarity motion model as the following 
hyperplane in C 3 : 



bfz= [A iRiTi 1] 




(16) 



Therefore, the segmentation of 2-D similarity motions is equivalent to clustering hy- 
perplanes in C 3 . As such, we can apply Algorithm 1 with K = 3 to a collection of 
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N > M n { 3) — 1 ~ 0(n 2 ) image measurements { z £ C 3 }^ 1; with at least two 
measurements per motion model, to obtain the motion parameters {6; £ C 3 }" =1 . The 
original real motion parameters are then given as 

Ai = |6a|, Oi = Zba, and Ti = [Re(6i2),Im(6i 2 )] T , for i = 1, . . . ,n. (17) 



The case of optical flow. Let {uj £ R 2 be N measurements of the optical flow 
at the N pixels {xj £ R 2 }j)Li- We assume that the optical flow field can be modeled 
as a collection of n 2-D similarity motion models as u = XiRiX + T Therefore, the 
segmentation of 2-D similarity motions from measurements of optical flow can be solved 
as in the case of feature points, after replacing a; 2 = u and x x = x. 



2.3 Segmentation of 2-D Affine Motions: Clustering Hyperplanes in C 4 



The case of feature points. In this case, we assume that the images are related by a 
collection of n 2-D affine motion models { A , £ R 2x3 }" =1 . That is, for each feature pair 
(xi, X 2 ) there exist a 2-D affine motion A, such that 



Xi 




an ai2 013 




Xl 


1 




_a 2 i a 2 2 «23 


i 


1 



( 18 ) 



Therefore, if we interpret x 2 as a complex number x 2 £ C, but we still think of x, \ as a 
vector in R 2 , then we have 



* 2 = a 2 



x x 

1 



— [an + a 2 i\/— T ai2 + 0223/—! <*13 + a 2 3V^~l] 



Xi 

1 



(19) 



The above equation represents the following hyperplane in C 4 



biz = [c 



1 ] 



*1 

1 

-X 2 



= 0 , 



( 20 ) 



where the normal vector bi £ C 4 encodes the affine motion parameters and the data 
point z £ C 4 encodes the image measurements X \ £ R 2 and x 2 £ C. Therefore, the 
segmentation of 2-D affine motion models is equivalent to clustering hyperplanes in C 4 . 
As such, we can apply Algorithm 1 with K = 4 to a collection of N > M n (4) — 1 ~ 
0(n 3 ) image measurements {z :l £ C 1 t , with at least three measurements per motion 
model, to obtain the motion parameters {bi £ C 3 }" =1 . The original affine motion models 
are then obtained as 



Ai — 



Re(b a ) Re(6, ;2 ) R e(b i3 ) 
Im(6ji) Im (b i2 ) Im(b i3 ) 



£ R 



2x3 



for i 



1 ,.. 



n. 



( 21 ) 



The case of affine optical flow. In this case, the optical flow u is modeled as being 
generated by a collection of n affine motion models {A, £ 
x 
1 

case of feature points, after replacing x 2 = u and X \ = x. 



u 



= A, 



2X3 KU of the form 
. Therefore, the segmentation of 2-D affine motions can be solved as in the 
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3 3-D Motion Segmentation 



3.1 Segmentation of 3-D Translational Motions: Clustering Hyperplanes in R 3 



The case of feature points. In this case, we assume that the scene can be modeled as 
a mixture of purely translational motion models, {T] £ M :i }’L-| , where 7', represents 
the translation (calibrated case) or the epipole (uncalibrated case) of object i relative to 
the camera between the two frames. A solution to this problem based on polynomial 
factorization was proposed in [12]. Here we present a much simpler solution based on 
polynomial differentiation. 

Given the images X\ £ P 2 and x 2 £ P 2 of a point in object i in the first and second 
frame, they must satisfy the well-known epipolar constraint for linear motions 

= Tj(x 2 x a:i) = Tj i = 0, (22) 

where i = [x 2 x x 2 ) £ R 3 is known as the epipolar line associated with the image 
pair (* 1 , x 2 ). Therefore, the segmentation of 3-D translational motions is equivalent to 
clustering data (epipolar lines) lying on a collection of hyperplanes in M :! whose normal 
vectors are the n epipoles {T)}™ =1 . As such, we can apply Algorithm 1 with K = 3 to 
N > M n ( 3) — 1 ~ 0(n 2 ) epipolar lines {(? = x{ x , with at least two epipolar 

lines per motion, to estimate the epipoles {T)}™ =1 from the derivatives of the polynomial 
p n (£) = (Ti €) ■ ■ ■ (T^£). The only difference is that in this case the last entry of each 
epipole is not constrained to be equal to one. Therefore, when choosing the points y i in 
equation (13) we should take 77 = 7 not to eliminate the last coordinate. We therefore 
compute the epipoles up to an unknown scale factor as 



Ti = 77p rt (y J )/||77p ri (y i )||, i=l,...,n, 

where the unknown scale is lost under perspective projection. 



(23) 



The case of optical flow. In the case of optical flow generated by purely translating 
objects we have tt T [T[] X x = 0, where u is interpreted as a three vector [u, v, 0] T £ R 3 . 
Thus, one can estimate the translations {Ti £ R 3 }'" =1 as before by replacing x 2 = u 
and X\ = x. 



3.2 Segmentation of 3-D Rigid-Body Motions: Clustering Quadratic Forms in 

R 3x3 

Assume that the motion of the objects relative to the camera between the two views can 
be modeled as a mixture of 3-D rigid-body motions {(7?,;, Ti) £ S'77(3)}"_ 1 which are 
represented with a nonzero rank-2 fundamental matrix Fi. A solution to this problem 
based on the factorization of bi-homogeneous polynomials was proposed in [14]. Here 
we present a much simpler solution based on taking derivatives of the so-called multibody 
epipolar constraint (see below), thus avoiding polynomial factorization. 

Given an image pair (x \ , x 2 ) , there exists a motion i such that the following epipolar 
constraint is satisfied 



x^FiXi = 0 . 



(24) 
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Therefore, the following multibody epipolar constraint [14] must be satisfied by the 
number of independent motions n, the fundamental matrices { 7’,} (A, and the image 
pair (xi, x 2 ), regardless of the object to which the image pair belongs 

n 

p n (xi,x 2 ) = {x^FiXi) = 0. (25) 

2 = 1 

It was also shown in [ 14] that the multibody epipolar constraint can be written in bilinear 
form as v n (x 2 ) T !Fv n {xi) = 0, where T £ R M n( 3 ) xM «( 3 ) is the so-called multibody 
fundamental matrix, which can be linearly estimated from N > M n ( 3) 2 — 1 ~ 0(n 4 ) 
image pairs in general position with at least 8 pairs corresponding to each motion. 

We now present a new solution to the problem of estimating the fundamental matrices 
{Tj}" =1 from the multibody fundamental matrix T based on taking derivatives of the 
multibody epipolar constraint. Recall that, given a point Xi £ P 2 in the first image 
frame, the epipolar lines associated with it are defined as ii = FiX\ £ R :i , i = 1. . . . . n. 
Therefore, if the image pair (x\, x 2 ) corresponds to motion i, i.e. if xf F,X\ = 0, then 



d 

« — v 1 
OX 2 



(x 2 ) T lFis n (xi) = 'Y^Y\(x 2 F e x 1 )(F i x 1 ) = Y[(x 2 F e x 1 )(F i x 1 ) 
i— 1 l^i i^i 



ti- (26) 



In other words, the partial derivative of the multibody epipolar constraint with respect to 
x 2 evaluated at (x±, x 2 ) is proportional to the epipolar line associated with (x-\ ,x 2 ) in 
the second view. 1 Therefore, given a set of image pairs {(arj, xf) }jLj and the multibody 
fundamental matrix T £ r m »(3)xm„( 3)^ we can es ti ma t e a collection of epipolar lines 
Remember from Section 3.1 that in the case of purely translating objects the 
epipolar lines were readily obtained as X \ x x 2 . Here the calculation is more involved 
because of the rotational component of the rigid-body motions. Nevertheless, given a 
set of epipolar lines we can apply Algorithm 1 with K = 3 and 77 = I to estimate the 
n epipoles {Tj}" =1 up to a scale factor, as in equation (23). Therefore, if the n epipoles 
are different, 2 then we can immediately compute the n fundamental matrices { 7) }" =1 
by assigning the image pair {x\ , xif) to group i if i = argmin^ = i ,... n {T^£ J ) 2 and then 
applying the eight-point algorithm to the image pairs in group i = 1 , . . . , n. 



3.3 Segmentation of 3-D Homographies: Clustering Quadratic Forms in C 2x3 

The motion segmentation scheme described in the previous section assumes that the 
displacement of each object between the two views relative to the camera is nonzero, 
i.e. Ti / 0, otherwise the individual fundamental matrices are zero. Furthermore, it also 

1 Similarly, the partial derivative of the multibody epipolar constraint with respect to x\ evaluated 
at ( x\ , x 2 ) is proportional to the epipolar line associated with (x\ ,x 2 ) in the first view. 

2 Notice that this is not a strong assumption. If two individual fundamental matrices share the 
same (left) epipoles, one can consider the right epipoles (in the first image frame) instead, 

because it is extremely rare that two motions give rise to the same left and right epipoles. In 
fact, this happens only when the rotation axes of the two motions are equal to each other and 

parallel to the translation direction [14], 
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requires that the 3-D points be in general configuration, otherwise one cannot uniquely 
recover each fundamental matrix from its epipolar constraint. The latter case occurs, for 
example, in the case of planar structures, i.e. when the 3-D points lie on a plane [16]. 

Both in the case of purely rotating objects (relative to the camera) or in the case of 
a planar 3-D structure, the motion model between the two views Xi £ P 2 and x 2 £ P 2 
is described by a homography matrix H £ M 3x3 such that [16] 







fill 


fil2 


fil3 


x 2 - 


^ Hxi = 


fi'21 


fi-22 


fi-23 






CO 

-c: 


fi-32 


fi-33 



(27) 



Consider now the case in which we are given a set of image pairs {(arj, x J 2 )}jLi that 
can be modeled with n independent homographies { //,}(£-, (see Remark 2). Note that 
the n homographies do not necessarily correspond to n different rigid-body motions. 
This is because it could be the case that one rigidly moving object consists of two or more 
planes, hence its rigid-body motion will lead to two or more homographies. Therefore, 
the n homographies can represent anything from 1 up to n rigid-body motions. In either 
case, it is evident from the form of equation (27) that we cannot take the product of 
all the equations, as we did with the epipolar constraints, because we have two linearly 
independent equations per image pair. Nevertheless, we show now that one can still solve 
the problem by working in the complex domain, as we describe below. 

We interpret the second image x 2 G P 2 as a point in CP by considering the first two 
coordinates in x 2 as a complex number and appending a one to it. However, we still 
think of x\ as a point in P 2 . With this interpretation, we can rewrite (27) as 



x 2 ~ Hxi = 



fill + fi-21 
fi31 



fii2 + h 22 \J—\ 

fi 32 



fil3 + fi23V~ 
fi33 



* 1 , 



(28) 



where H £ C 2x3 now represents a complex homography 3 . Let w 2 be the vector in CP 
perpendicular to x 2 , i.e. if x 2 = ( z , 1) then w 2 = (1, —z). Then we can rewrite (28) as 
the following complex bilinear constraint 



w^Hxi = 0, (29) 

which we call the complex homography constraint. We can therefore interpret the motion 
segmentation problem as one in which we are given image data {x{ £ P 2 }^! and 
{tDjGCPjjLj generated by a collection of n complex homographies {Hi £ C 2x3 }" =1 . 
Then each image pair (x t . w 2 ) has to satisfy the multibody homography constraint 

n 

Wiw^HiX i) = Vniw^Uv^Xi) = 0, (30) 

2 = 1 

regardless of which one of the n complex homographies is associated with the image 
pair. We call the matrix 77 £ C M ™( 2 ) xM »( 3 ) the multibody homography. Now, since the 
multibody homography constraint (30) is linear in the multibody homography 77, we 

3 Strictly speaking, we embed each real homography matrix into an affine complex matrix. 
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can linearly solve for 77 from (30) given TV > M n (2)M n (3) — (M„(3) + l)/2 ~ 0(n 3 ) 
image pairs in general position 4 with at least 4 pairs per moving object. 

Given the multibody homography 77 £ <C Mn ^ xMrl ^ 3 \ the rest of the problem is 
to recover the individual homographies {77,}" =1 . In the case of fundamental matrices 
discussed in Section 3.2, the key for solving the problem was the fact that fundamental 
matrices are of rank 2, hence one can cluster epipolar lines based on the epipoles. In 
principle, we cannot do the same with real homographies Hi £ R :ix3 , because in general 
they are full rank. However, if we work with complex homographies //, £ C 2x3 they 
automatically have a right null space which we call the complex epipole e* £ C 3 . Then, 
similarly to (26), we can associate a complex epipolar line 

■ dv n (w 2 ) T 'His n (xi) 

1 fat 



£ CP 2 



(31) 



->^ = ^3 



with each image pair (x\ , wf). Given this set of N > M n ( 3) — 1 complex epipolar 
lines {(? l ^-i , with at least 2 lines per moving object, we can apply Algorithm 1 with 
K = 3 and 77 = 7 to estimate the n complex epipoles { e, £ C 3 }" =1 up to a scale 
factor, as in equation (23). Therefore, if the n complex epipoles are different, we can 
cluster the original image measurements by assigning image pair (x^x^) to group i if 
i = argminf = i \ej£ J \ 2 . Once the image pairs have been clustered, the estimation 
of each homography, either real or complex, becomes a simple linear problem. 

Remark 1 (Direct extraction of homographies from 77). There is yet another way to 
obtain individual Hi from 77, without segmenting the image pairs first. Once the complex 
epipoles e, are known, one can compute the following linear combination of the rows 
of Hi (up to scale) from the derivatives of the multibody homography constraint at e,; 



H, 



dv n (w) T TLv n (x) 

dx 



£ CP 2 , Vw £ C 2 . 



(32) 



In particular, if we take w = [1, 0] T and w = [0, 1] T we obtain the first and second row 
of Hi (up to scale), respectively. By choosing additional w’s one obtains more linear 
combinations from which the rows of Hi can be linearly and uniquely determined. 



Remark 2 (Independent homographies). The above solution assumes that the complex 
epipoles are different (up to a scale factor). We take this assumption as our definition of 
independent homographies, even though it is more restrictive than saying than the real 
homographies Hi £ R 3 x 3 are different (up to a scale factor). However, one can show 
that, under mild conditions, e.g., the third rows of each 77, are different, the null spaces 
of the complex homographies are indeed different for different real homographies. 5 

4 The multibody homography constraint gives two equations per image pair, and there are 
(M„( 2) — l)M n (3) complex entries in 77 and M„( 3) real entries (the last row). 

5 The set of complex homographies that share the same null space is a five-dimensional subset 
(hence a zero-measure subset) of all real homography matrices. Furthermore, one can com- 
plexify any other two rows of H instead of the first two. As long as two homography matrices 
are different, one of the complexifications will give different complex epipoles. 
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Remark 3 (One rigid-body motion versus multiple ones). A homography is generally 
of the form H = R + Tn T where ir is the plane normal. If the homographies come 
from different planes (different 7 r) undergoing the same rigid-body motion, the proposed 
scheme would work just fine since different normal vectors 7r will cause the complex 
epipoles to be different. However, if multiple planes with the same normal vector 7 r = 
[0, 0, 1] T undergo pure translational motions of the form X) = [T X i, T y i , T Z i] J , then 
all the complex epipoles are equal to e* = [V— 1, — 1, 0] T . To avoid this problem, one 
can complexify the first and third rows of H instead of the first two. The new complex 
epipoles are ei=[T x i+T Z iy/—l, T y i , — 1] T , which are different for different translations. 

4 Experiments on Real and Synthetic Images 



2-D translational. We tested our polynomial differentiation algorithm (PDA) by seg- 
menting 12 frames of a sequence consisting of an aerial view of two robots moving on 
the ground. The robots are purposely moving slowly, so that it is harder to distinguish the 
flow from the noise. At each frame, we applied Algorithm 1 with K = 2 and 6 = 0.02 
to the optical flow 6 of all N = 240 x 352 pixels in the image and segmented the image 
measurements into n = 3 translational motion models. The leftmost column of Figure 1 
displays the x and y coordinates of the optical flow for frames 4 and 10, showing that 
it is not so simple to distinguish the three clusters from the raw data. The remaining 
columns of Figure 1 show the segmentation of the image pixels. The motion of the two 
robots and that of the background are correctly segmented. We also applied Algorithm 1 
to the optical flow of the flower garden sequence. Figure 2 shows the optical flow of one 
frame and the segmentation of the pixels into three groups: the tree, the grass, and the 
background. Notice that the boundaries of the tree can be assigned to any group, and in 
this case they are grouped with the grass. 




6 We compute optical flow using Black’s code at 
http://www.cs.brown.edu/people/black/ignc.html. 
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3-D translational motions. Figure 3(a) shows the first frame of a 320 x 240 video 
sequence containing a truck and a car undergoing two 3-D translational motions. We 
applied Algorithm 1 with K = 3, 77 = I and 5 = 0.02 to the (real) epipolar lines 
obtained from a total of N = 92 features, 44 in the truck and 48 in the car. The algorithm 
obtained a perfect segmentation of the features, as shown in Figure 3(b), and estimated 
the epipoles with an error of 5.9° for the truck and 1.7° for the car. We also tested 
the performance of PDA on synthetic data corrupted with zero-mean Gaussian noise 
with s.t.d. between 0 and 1 pixels for an image size of 500 x 500 pixels. For comparison 
purposes, we also implemented the polynomial factorization algorithm (PFA) of [12] and 
a variation of the Expectation Maximization algorithm (EM) for clustering hyperplanes 
in R 3 . Figures 3(c) and (d) show the performance of all the algorithms as a function of 
the level of noise for n = 2 moving objects. The performance measures are the mean 
error between the estimated and the true epipoles (in degrees), and the mean percentage 
of correctly segmented features using 1000 trials for each level of noise. Notice that 




(a) First frame (c) Translation error n = 2 (e) Translation error 

n = 1, ... ,4 




(b) Feature segmentation (d) % of correct classif. n — 2 (f) % of correct classif. 

n= 1, - - - , 4 



Fig. 3. Segmenting 3-D translational motions by clustering planes in R 3 . Left: segmenting a real 
sequence with 2 moving objects. Center: comparing our algorithm with PFA and EM as a function 
of noise in the image features. Right: performance of PFA as a function of the number of motions 
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PDA gives an error of less than 1.3° and a classification performance of over 96%. 
Thus our algorithm PDA gives approximately 1/3 the error of PFA, and improves the 
classification performance by about 2%. Notice also that EM with the normal vectors 
initialized at random (EM) yields a nonzero error in the noise free case, because it 
frequently converges to a local minimum. In fact, our algorithm PDA outperforms EM. 
However, if we use PDA to initialize EM (PDA+EM), the performance of both EM and 
PDA improves, showing that our algorithm can be effectively used to initialize iterative 
approaches to motion segmentation. Furthermore, the number of iterations of PDA+EM 
is approximately 50% with respect to EM randomly initialized, hence there is also a gain 
in computing time. Figures 3(e) and (f) show the performance of PDA as a function of 
the number of moving objects for different levels of noise. As expected, the performance 
deteriorates with the number of moving objects, though the translation error is still below 
8° and the percentage of correct classification is over 78%. 



3-D homographies. Figure 4(a) shows the first frame of a 2048 x 1536 video sequence 
with two moving objects: a cube and a checkerboard. Notice that although there are only 
two rigid motions, the scene contains three different homographies, each one associated 
with each one of the visible planar structures. Furthermore, notice that the top side of 
the cube and the checkerboard have approximately the same normals. We manually 
tracked a total of N = 147 features: 98 in the cube (49 in each of the two visible sides) 
and 49 in the checkerboard. We applied our algorithm in Section 3.3 with II = I and 
S = 0.02 to segment the image data and obtained a 97% of correct classification, as 
shown in Figure 4(b). We then added zero-mean Gaussian noise with standard deviation 
between 0 and 1 pixels to the features, after rectifying the features in the second view in 
order to simulate the noise free case. Figure 4(c) shows the mean percentage of correct 
classification for 1000 trials per level of noise. The percentage of correct classification 
of our algorithm is between 80% and 100%, which gives a very good initial estimate for 
any of the existing iterative/optimization/EM based motion segmentation schemes. 





2x3 



Fig. 4. Segmenting 3-D homographies by clustering complex bilinear forms in C' 
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5 Conclusions 

We have presented a unified algebraic approach to 2-D and 3-D motion segmentation 
from feature correspondences or optical flow. Contrary to extant methods, our approach 
does not iterate between feature segmentation and motion estimation. Instead, it com- 
putes a single multibody motion model that is satisfied by all the image measurements 
and then extracts the original motion models from the derivatives of the multibody one. 
Various experiments showed that our algorithm not only outperforms existing algebraic 
methods with much limited applicability, but also provides a good initialization for 
iterative techniques, such as EM, which are strongly dependent on correct initialization. 
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Abstract. Particle filters provide a means to track the state of an ob- 
ject even when the dynamics and the observations are non-linear/non- 
Gaussian. However, they can be very inefficient when the observation 
noise is low as compared to the system noise, as it is often the case in 
visual tracking applications. In this paper we propose a new two-stage 
sampling procedure to boost the performance of particle Liters under 
this condition. We provide conditions under which the new procedure is 
proven to reduce the variance of the weights. Synthetic and real-world 
visual tracking experiments are used to confirm the validity of the theo- 
retical analysis. 



1 Introduction 

In this paper we consider particle filters in the special case when the observation 
noise is low as compared to the noise in the system’s dynamics (for brevity we 
call the latter noise the ‘system noise’). This is a typical situation in computer 
vision where the discrimination power of the object model is typically high. Such 
models may e.g. use shape, contour, colour, intensity information or a combina- 
tion of these and give rise to a highly peaked, low entropy observation likelihood 
function. Highly discriminative observation likelihoods are very desirable as they 
results in highly peaked posteriors and hence, in theory, the position of the object 
can then be estimated with high precision. 

In practice the posterior cannot be obtained in a closed form, except in a few 
special cases. Thus in general one must revert to some approximate method to 
estimate the posterior. Particle filters represent a rich class of such approximate 
methods. They represent the posterior using a weighted particle set living in 
the state space of the process. Upon the receipt of a new observation the generic 
particle filter algorithm updates the position of the particles and recomputes the 
weights so that the new weighted sample becomes a good representation of the 
new posterior that takes into account the new observation, as well. The position 
of the particles are typically updated independently of each other by drawing 
them from a user-chosen proposal distribution. If the new particles are not in the 
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close vicinity of the modes of the likelihood function and the likelihood function 
is highly peaked then the representation of the posterior will degrade very fast. 
The peakier the observation likelihood function the more important is to sample 
the particles such that they will be close to the modes. 

This paper introduces a method that draws the new positions of the particles 
using a two-stage sampling process that depends on the likelihood function and 
hence one expects the new method to have superior performance than that of 
those algorithms that do not use the likelihood function. At the core of the new 
filter is sampling method that is applicable when the density to sample from 
has a product form, with one of the terms being highly peaked, whilst the other 
having heavy tails. Under this conditions the new sampling method represents 
an alternative to importance sampling. The new method works by first drawing 
a particle from the broader density. In the second step this particle is perturbed 
such that on average it moves closer to one of the modes of the peaky term. 
Weights are calculated so that unbiasedness of the new sample is guaranteed. 
We compare the expected performance of the proposed scheme by means of a 
theoretical analysis to that of the basic importance sampling scheme and derive 
conditions under which the new scheme can be expected to perform better. The 
comparison is extended to the particle filtering setting, as well. The theoretical 
findings are confirmed in some computer experiments. In particular, the method 
is applied to the tracking of Japanese license plates where the new algorithm is 
shown to improve performance substantially both in terms of tracking accuracy 
and speed. 



1.1 Related Work 

The efficiency problem associated with low observation noise is well known in 
the literature and hence many approaches exist to resolving it. Among the many 
methods the Auxiliary variable Sampling Importance Resampling (ASIR) filter 
introduced by Pitt and Shephard [1] is one of the closest to our algorithm. ASIR 
approximates the proposal density using a mixture of the form Ylk=i Pk7k{') 
where weight approximates the normalized likelihood of the new observation 
( Y t ) assuming that the state of the process at the t — 1 th time step is 
The function 7 *,(•) approximates the density p(x f \X^ k } 1 ,Y 1 , ... ,Y t )} The biggest 
obstacle in applying ASIR is to obtain a good approximation of the conditional 
likelihood of the new observation given A t _ 1; since this involves the evaluation 
of a potentially high dimensional integral and the efficiency of ASIR ultimately 
depends on the quality of this approximation. It turns out that except for some 
special cases it is not easy to come up with a sampling scheme that could sample 
from p{x t \x[ k } 1 , Yi, . . . ,Y t ) oc p{Y t \x t )p{x t \X^ l ) in an efficient manner. This 
issue is the problem we are addressing in this paper. 

Another recent proposal is called ‘likelihood sampling’ (see e.g. [2]). In this 
approach it is the likelihood function p{Y t |-) that is used to generate the samples 



1 Here and in what follows random variables are denoted by big capitals. 
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from and the prediction density is used to calculate the weights. The success 
of this method depends on the amount of ‘state aliasing’ that comes from the 
limitations of the observation model. For multi-modal observation likelihoods a 
large number of particles can be generated that will have low weights when the 
posterior already concentrates on a small part of the state space. Our method 
overcomes this problem by first sampling from the prediction density. As a result, 
our method can suffer from inefficiency if the estimates posterior is considerably 
off from the target so we expect our method being competitive only when this 
is not the case. 

Perhaps the most relevant to this work is the LS-N-IPS algorithm introduced 
in [3]. In LS-N-IPS the prediction density is used to derive the new particle set 
which is then locally modified by climbing the observation likelihood. Hence 
this algorithm introduces some bias and also needs an efficient method to climb 
the observation likelihood. The method proposed here resolves the inefficiency 
problem without introducing any additional bias or requiring a hill-climbing 
method. 



2 Notation 

The following notations will be used: for an integrable function /, 1(f) will 
denote the integral of / with respect to the Lebesgue measure. 2 denotes the 
d-dimensional Euclidean space. L p (0 < p < + 00 ) denotes the set of functions 
with finite p-norm. The p-norm of a function is denoted by ||/|| p . For a function 
/ e L s (M. d ), s S {1, 2}, / denotes its Fourier transform: f(u>) = / e~ luTx f(x)dx . 
The inner product defined over L 2 (K d ) is defined by (/, g) = f f(x)g(x)dx , 
where a denotes the complex conjugate of a. Convolution is denoted by *: (/ * 
g)(x) = f f(y)g(x — y)dy . Expectation is denoted by E and variance by Var, as 
usual. 



3 Random Representation of Functional Products 

Our main interest is to generate random samples that can be used to represent 
products with two terms /, p, where / is an integrable function (/ plays the role 
of the observation likelihood) and p is a density (the prediction density). We 
begin with the definition of what we mean by a properly weighted set w.r.t. / 
and p. This definition is a slightly modified version of the definition given in [4] : 



Definition 1 A random variable X is said to be properly weighted by the func- 
tion w with respect to the density p and the integrable function f if for any 
integrable function h, E[h(X)w(X)} = J h(x)f(x)p(x) dx . Also, in this case 
(X, w(X)) is said to form a properly weighted pair with respect to f,p. 

2 The underlying domain of the functions is not important at this point. It could be 
any Polish set, e.g. an Euclidean space. 
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A set of random draws and weight functions {(Xj,Wj)}j—i } ,,,^N is said to 
be properly weighted with respect to p and f if all components of the set are 
properly weighted with respect to p and f . 



It should be clear that if {(Xj, Wj)}j = i y ,„ ^ is a properly weighted set with 
respect to p and / then, by the law of large numbers, the sample averages 
JN(h,w) = (1 /N) ]CyLi h{Xj)w{Xj) will converge to I(hfp ) under fairly mild 
conditions. Hence, in this sense, a properly weighted set with respect to / and 
p can thought of as representing the product density /(-)p(-). 

Let us now consider two constructions for properly weighted sets. It is clear 
from the definition that it is sufficient to deal with the case of a single random 
variable. 

The most obvious way to obtain a properly weighted pair is to sample X from 
p and define w = /. Then, trivially E[h(X)w(X)\ = E[h(X)f(X)\ = I(hfp). 
We shall call this the canonical or basic sampling scheme. 

A central question of Monte-Carlo sampling is how to obtain a properly 
weighted set such that the variance of the estimate of I(hfp) provided by the 
sample average w) is minimized. Actually, we are more interested in study- 
ing the weight-normalized averages Ij^(h,w) = ^ that converge to 



the normalized value I(hfp)/I(fp) as N — > oo with probability one, under a 
broad range of conditions on ( Xj,Wj ). Obviously, the optimal sampling con- 
struction depends on h. Since we are interested in the case when h is not fixed, 
it is sensible to use the “rule of thumb” presented in Liu [5] (based on [6]) 
to measure the efficiency of a sampling construction by a quantity inversely 
proportional to the variance of w(X ), where w is such that E[w(X)] = 1. By 
straightforward calculations one can show that Liu’s measure still applies to our 
case, because E[w(X)] = I(fp) = const , independently of the choice of X and 
w. We make this rule as our starting point and will compare different sampling 
schemes by the variance of the weights. Now, since for any properly weighted 
pair (X,w), E[w(X)\ = I(fp) and Var[w(X)] = E[w 2 (X)\ — E[w(X)] 2 we find 
that minimizing Var[w(X)] is equivalent to minimizing E[w 2 (X)]. Note that if 
X is drawn from p and w is set to be equal to /, then E[w 2 (X)] = I(f 2 p). 

The sampling scheme we propose works by locally perturbing the samples 
drawn from p to move them closer to the modes of /. Let g € L 1 be a compactly 
supported function with 1(g) = 1. 



Locally Perturbed Sampling Procedure 

1. Draw N independent samples Xi , ... ,Xjy from p. 

2. For each 1 < j < N, draw samples Zj from ^ ^ z ■ 

3. Calculate the weights Wj = (f * g)(Xj)p(Zj)/p(Xj) and output {(Zj,Wj)}. 

The algorithm first draws samples from p , just like the canonical one. In the 
second step, the samples are ‘moved’ towards the modes of /, but stay in the 
close vicinity of the drawn samples thanks to the compact support of g. 3 Hence 

3 A slight variant can be obtained by employing a two-variable kernel function G(x, z) 
in place of g(x — z). Then in the algorithm (/ * g)(Xj) is replaced by f f(z)G(x, z)dz. 
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the name of the procedure. In the next step, weights are calculated so that 
( Zj,Wj ) becomes a properly weighted pair for f,p. 

In the above algorithm the function g could be a truncated Gaussian, or 
the characteristic function of some convex set. In one extreme but practical 
case g equals the weighted sum of translated Dirac-delta functions: g m (x) = 
v kS{x — tk)-We shall call such a function a Dirac-comb. When g is a Dirac- 
comb, the random sample Zj is drawn from the weighted discrete distribution 
{{Xj + t h pi )}, where pi oc f(Xj + ti). 

The following proposition shows that the proposed sampling scheme results 
in unbiased samples. 4 

Proposition 1 Assume that g £ L 1 is a compactly supported function satisfying 
g > 0 and 1(g) = 1. Let f be a bounded, integrable function and letp be a density. 
Then, the above sampling procedure yields properly weighted pairs (Zj,Wj) with 
respect to f,p. 

The efficiency of the scheme will obviously depend on the correlation of / and 
p: if the modes of p were far away from the modes of / then the scheme will be 
inefficient. Another source of inefficiency is when the support of g is too small 
to move the samples to the vicinity of the modes of / or when it is too large. 
In the limit when the support of g grows to X the scheme reduces to likelihood 
sampling. The following proposition provides the basic ground for the analysis 
of the efficiency of this scheme. 

Proposition 2 Assume that g £ L 1 is an even, compactly supported function 
satisfying g > 0 and 1(g) = 1 and let f be a bounded, integrable, nonnegative 
function and let p be a density. Define the operator A : L 1 — > L°° by 



Assume that for some s £ [l,oo], e = sup^ ei i ^ >0 sup u (A/i)(u)/||/i|| s < +oo. 



From this proposition it follows immediately that the proposed scheme is more ef- 
ficient than the canonical algorithm whenever el(f)\\f\\ s < (/, fp) — (f *g, (fp)* 
g). Clearly, this formula agrees well with our earlier intuition: the right hand 
side is maximized, when the cross-correlation of / and fp is high and the cross- 
correlation of f * g and (fp) * g is small. In some cases convolution with g 
can thought of as a low-pass filtering operation (e.g. think about when g is the 
characteristic function of the unit interval) and hence g cuts some of the high 
frequency of / and fp. As a result, the cross-correlation of / * g and (fp) * g can 
be expected to be smaller than the cross-correlation of / and fp. 

4 Here and in what follows the proofs are omitted due to a lack of space. An extended 
version of the paper available on the website of the authors contains all the missing 
proofs. The proof of this proposition uses Fubini’s theorem and I(f * g) = 1(f). 



(Ah)(u) 



I h(t)p(t)g(t - u) (gg - l) 



dt , when p(u) > 0; 
otherwise. 



0 , 
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It remains to see, however, whether A is bounded and e is well defined. 
For this define Pd(x) = mf^ y » <d p(x + y). Then, one can show that for s = 1, 
e < ||_p||oo||p/Pd||oo < +oo. Of course, this bound is not particularly tight and 
much tighter bounds can be derived in special cases. For example when g is 
equal to the Dirac-comb defined earlier and if max*, ||tfc|| < d then ( Af){u ) < 
||/||oo|b||oo|b/Pd||oo- Therefore, in this case Proposition 2 holds with s = +oo. 
By means of some convexity arguments one may also derive bounds for mixture 
densities. This can be useful to derive sharper bounds on e. 



4 Particle Filters Enhanced by Local Likelihood Sampling 

Let us now consider the problem of filtering a non-linear system of the form 
X t = a(X t _i) + W t , Y t = b(X t ) + V t , where X t £ X is the state of the system 5 
at time t and Y t £ R p is the observation at time t. We assume that Ao ~ po- Here 
W -] , V \ , W' 2 , V 2 , ■ ■ . are independent, Wi, W 2 , . . . are identically distributed, just 
like Vi,V 2 , ... . For the sake of simplicity, we further assume that the densities 
K(x \x') = p(X t = x\X t _± = x') and f(y\x) = p(Y t = y\X t = x) exist. The 
problem we consider is the estimation of the posterior p(X t \Yi :t ), where Y\. t 
denotes the sequence of past observations: Y\. t = (Yt . ... ,Y t ). 

Particle filters approximate the posterior by a random measure n t (x) = 

(SfcLi w [ k ^( x ~ /J2k=i w t k \ where X^ are called the particles, and 

w\ k ^ is the weight of the ith particle. x[ k \w^ are random quantities and de- 
pend on the sequence of past observation Yi :t . The best known particle filter is 
probably the SIR 6 filter [7], also known as CONDENSATION [8]: 

SIR Filter 

1. Draw N independent samples Aq 1 ^, . . . , X^ 1 ' 1 from p 0 . 

2. Repeat for t = 1, 2, . . . : 

a) Draw x[ k ^ ~ q(x t \X^ k } 1 ,Y t ) , k = 1, . . . , N independently of each other. 

b) Calculate the weights 

w[ k) = /q(x[ k) \x[%Y t ). 

c) Draw a sequence of independent indexes j 1 , ... , jjv such that p(ji = k) oc 
w and set X 1 /'' 1 = xj : ' k> . The corresponding weights are set to 1/A. 

SIR uses importance sampling to sample from ^ ’ hence its ef- 
ficiency will depend on how well the proposal density q matches the shape of 
this function. A particularly popular choice for the proposal q is the prediction 
density K: q(x t \Xl: k \,Yt) = K(x t \xj: In this case drawing from q is equiv- 
alent to simulating the dynamics of the system for a single time-step starting 
from A/_ / . This is typically simple to implement, hence the popularity of this 



5 Typically we will have X = R d . 

6 Sampling Importance Resampling 
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choice. Also, in this case the weights become particularly simple to compute: 
vjf k ^ = f(Y t \X f k ^) . We shall call the SIR algorithm with this choice the “basic” 
or “canonical” SIR algorithm. 

Following the “rule of thumb” discussed in the previous section we shall 
measure the efficiency of sampling at time step t by Var[uy^ A^" 1 , , Y t :t ] . 

Our proposed new particle filter uses the method of the previous section to 
boost the performance of this algorithm in the case of low observation noise: 

Local Likelihood Sampling SLR (LLS-SIR) 

1. Draw N independent samples xj^\ . . . , X from p$. 

2. Repeat for t = 1, 2, . . . : 

a) Draw X^ ~ K(- jA^), k = 1, . . . ,N, independently of each other. 

b) Draw Z ~ f (Y t \-)g(xj. k) — •)/ independently of each other, where 

a ( t k) = f f{Yt\x)g(X ( t k) ~ x)dx . 

c) Calculate the weights w[ k) = a {k) K(z[ k \x[ k \)/K(x[ k) \x[ k \). 

d) Resample from {{z[ k \p^)} with p’f oc w[ k ^ just like it was done in SIR 

(k) 

to get the particles X t . Set the weights uniformly to l/N. 

The algorithm is identical to SIR except that the new particle positions are 
determined using the two-stage sampling procedure introduced in the previous 
section. 

The following proposition shows the 1-step unbiasedness of the algorithm: 

Proposition 3 Assume that g is a non-negative, compactly supported, inte- 
grable function satisfying 1(g) = 1. Then LLS-SIR does not introduce any more 
bias than the SIR algorithm in the sense that for any integrable function h one 
has 

E[w {k) h(z[ k) )\Y 1:t ] = E[h(X t )\Y v MYt\Yi:t-i)- 

As a consequence of this proposition, convergence results analogous to those that 
are known for the SIR algorithm can be derived. 

Now, we compare the efficiency of the proposed algorithm with that of the 
basic SIR algorithm. We shall focus on the case when g is a Dirac-comb since 
this choice allows one to implement the filter for continuous state spaces which 
is the case that we are particularly intrested in. 



5 Variance Analysis: The Case of the Dirac-Comb 

In this section for the sake of simplicity we consider one-dimensional systems 
only. Note that these results extend to multi-dimensional systems without any 

' Here g is a compactly supported, integrable, nonnegative function satisfying 1(g) = 1 
as before. 
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problems. 8 We shall consider the choice g m (x) = ^ Sz=-p!/-i )/2 ^(x — IX), where 

m, A > 0. 

The following theorem expresses the difference between the appropriately 
normalized variance of basic SIR and that of LLS-SIR. The normalization is 
intended to compensate for the m-times larger number of likelihood calculations 
required by LLS-SIR. The estimate developed here shows that LLS-SIR is more 
advantageous than SIR when started from the same particle set. 

Theorem 1 Assume that both the basic SIR and the proposed Local Likeli- 
hood Sampling algorithm each draw x[ k ^ from where xf; k \ is a 

common sample from a density p(-) « p(-|Yi :t _i). Let w[ k \siR) and Wf k \P) 
denote the (unnormalized) weights of the basic SIR and the Local Likelihood 
Sampling algorithm, respectively. Let A = j^(Va,r[w^ k \SIR)\X^ k } 1 ,Yi :t ] — 
^Var^k^P)!^^, Yi ; t] . Let e,s > 0 be defined as in Proposition 2, where 
in the definition of the operator A one uses p(-) = K(- Let /(•) = f(Y t |-). 
Then NA > ((/, fp) - (f * g, ( fp)*g )) - (^Var p [/] + el(f)\\f \\ s ) . Hence, 
the proposed sampling scheme is more efficient than the one used by SIR pro- 
vided that 



(/, P ) 2 >(f*g, (fp) *g) + el(f)\\f\\ s . (1) 

Let us now specialize (1) to the case when g equals to the equidistant 
Dirac-comb defined earlier. By using harmonic analysis arguments, one gets 
m/2(f * g m , fp * g m } 1/(2? x)I(f)I(fp), as m -> oo. Hence, (/ * g m , 

fp*g m ) ~ ~^I(f)I(fp)- Hence, condition 1 can be approximated by < f,p > 2 > 
A^I(f)I(fp) + e m /(/)||/||oo- Here, we have used e m instead of e in order to em- 
phasize the dependency of e on m. In general e m may (and often will) diverge to 
infinity asm-> oo. 9 As a result we get a tradeoff as a function of e m . In general 
one expects that when condition (1) is satisfied then it will be satisfied for an 
interval of values of m. 

6 Experiments and Results 

6.1 Simulation 

We have simulated the system x t = x t ~\/2 + 25a; t _i/(l + x\) + 8cos(1.2f) + 
W t , y t = \x t \/2Q + V t , where W t ~ AT(0,10) and V t ~ iV(0,2). 10 LLS-SIR 
was implemented by using a Gaussian kernel G a (x,z) oc exp(— (sgn(a;)a; 2 /20 — 
2 2 /20) 2 /(2 ct 2 )). In this case f(Y t \z)G(X t , z) becomes a Gaussian in z 2 and hence 
one can use importance sampling to sample Z from the corresponding density. 

8 Note, however, that for high-dimensional state spaces more efficient schemes are 
needed. One such scheme is given in Section 6. 

9 Note that when the state space X is compact then e m stays bounded. 

10 This system is a slight variant of one system that has been used earlier in a number 
of papers, in particular in the observation function we used |as* | instead of x 2 . 
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The system was simulated for T = 60 time steps and we have measured the 
performance in terms of the average RMSE of |x t |. The number of particles was 
N = 50. The RMSE results obtained are 2.12 and 0.94 for SIR and LLS-SIR, 
respectively, whilst LLS-SIR took 1.1 times more time. Thus, in this case LLS- 
SIR is slightly more expensive than SIR, but this may well pay off in its increased 
accuracy. 



6.2 Visual Tracking 

The proposed algorithm was tested and compared with basic SIR on the problem 
of tracking Japanese license plates. In these experiments, the outline of a license- 
plate was taken as a parallelogram with two vertical lines. 11 

Japanese license plates enjoy a very specific geometrical structure, see Fig- 
ure 1. This gives the basic idea of the observation model. The observation model 
is scale-free and the likelihood is expressed as the product of the likelihoods of 
the parts of the license plate where the parts are looked for at the precise loca- 
tions as dictated by the geometry of the license plates: For each designated area 
of a candidate plate we compute the likelihood of the observed pattern assuming 
that the area is of the “right type”. The calculus of these likelihoods is imple- 
mented in an ad hoc manner using simple image processing operations that rely 
on measuring frequency content in the spatial domain. Based on a larger sample 
of images we have found that the likelihood is sufficiently specific to these kind 
of license plates. An example image is included in Figure 3. 




Fig. 1. The model of a Japanese license plate. Checked means “thick line” area, dotted 
means “thin line” area, dashed line means “clear” area, solid line means “edge” 



The object dynamics is a mixture of an initial distribution po and a simple 
AR(2) product-process: in each time step with a constant probability we assume 
that the license plate reappears at a random position unrelated to the previous 
position. The probability of this event is set to a small value (0.1 in the experi- 
ments described here). Separate AR(2) models were used to evolve the position, 
scale, orientation and aspect ratio of the plate, each one independently of the 
others. The parameters of the AR(2) model were tuned by hand by conducting 
short preliminary experiments. 

11 Thus the configuration of a license plate on the image can be defined with 5 param- 
eters (assuming a fixed aspect ratio). 
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LLS was implemented only for the position of the plates. Denoting positions 
by (xi,X 2 ) and by p(x 1 , 2 : 2 ) the corresponding weights used in the LLS step, we 
used p(x 1 , X 2 ) = p(x\\x 2 )p(x 2 ) in the sampling step as follows. We sample first X 2 
from £>(£ 2 ), and then a: 1 from p(-|x 2 ). The distribution £>( 2 : 2 ) = Y^ Xl p(* 2 , x\) was 
approximated by making the corresponding observation likelihoods insensitive 
to the precise horizontal location. The search length was half of the size of the 
predicted plate size, in both directions. 



Results. The performance of the local sampling algorithm with N = 100 sam- 
ples was compared to the performance of the basic SIR filter using N = 750 
particles. The number of particles for LLS-SIR was preestimated so that we ex- 
pected the two algorithms to have roughly equivalent running times. It turned 
out that both algorithm were capable of running faster than real-time. In partic- 
ular, on our 1.7GHz Intel test machine we have measured a processing speed of 
approximately 48 frames per seconds for the basic SIR algorithm, whilst for LLS- 
SIR the measured processing speed was approximately 65 frames per seconds, 
i.e., we have slightly overestimated the resource requirements of LLS-SIR. 

Performance evaluation was done as follows: We selected a test video sequence 
that consisted of 298 frames. In each time step particle locations were averaged 
to get the final guess of the license plate position. This position was compared 
to the “ground truth” obtained by running basic SIR for the test video sequence 
with N = 10, 000 of particles and then correcting the results manually. Some 
frames of this sequence are shown in Figure 2. To be able to judge the difficulty of 
the tracking task Figure 3 shows the observation likelihood function of a selected 
frame. On this image the intensity of a pixel is proportional to the logarithm 




Fig. 2. Sample images of the test video sequence. The video sequence is recorded by 
a commercial NTSC camera. The frame indexes of the images are 9,29,82,105,117 and 
125. The plate positions predicted by LLS-SIR are projected back on the image 




26 



P. Torma and Cs. Szepesvari 




Fig. 3. Log-likelihood of a selected frame of the test video sequence. Note that pixel 
intensities are taken for the maximum of the logarithm of the obervation likelihood 
where scale is kept free. For more information see the text 




Fig. 4. Histogram of the probability of not tracking the object. Note the log-scales of 
the axis 



of the maximum of the observation likelihood where the maximum is taken for 
plate configurations with the center of gravity of plates matching the pixel’s 
position, the orientation matching the best orientation, but keeping the scale of 
the plates free. 

Define the distance of two license plate configurations as the sum of distances 
of their corresponding vertex points. If this distance is larger than one third of 
the license plate height then the license plate is considered to be “lost”. The 
probability of this event was estimated for each frame by means of running n = 
100 Monte-Carlo experiments. Figure 4 shows the histogram of the probabilities 
of object loss on the test sequence. Note the log-scale of both axis. The percentage 
of frames when LLS-SIR tracks the plates (i.e it never looses the plate in any 
of the experiments) is over 94%. The corresponding number is 77% for SIR. 
Tracking error was measured for those frames when the object was tracked by 
the respective algorithms. The median tracking error is 1.23 pixels and 5.36 
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pixels for LLS-SIR and SIR, respectively. The corresponding means are 3.62 and 
5.29, the standard deviations are 4.88 and 5.91. Thus, we conclude that in this 
case LLS-SIR is more efficient than the basic SIR algorithm both in terms of 
execution speed and tracking performance. 

7 Conclusions 

We have proposed a new algorithm, LLS-SIR to enhance particle filters in the low 
observation noise limit. The algorithm is a modification of the standard particle 
filter algorithm whereas after the prediction step the position of the particles are 
randomly resampled from the localized observation density. Theoretical analysis 
revealed that the scheme does not introduce any bias as compared to the basic 
SIR algorithm. It was also shown that the new algorithm achieves a higher 
effective sample size than the basic one when the observations are reliable. This 
results in a better tracking performance, as it was illustrated on a synthetic and 
a real world tracking problem. 

Further work shall include a more thorough evaluation of the proposed al- 
gorithm and more comparisons with competing algorithms. On the theoretical 
side, extending previous uniform convergence results to the new algorithm looks 
like an interesting challenge. Another important avenue of research is to extend 
the results of Section 5 so that one can compare the long term behaviour of the 
various algorithms. Derivation of lower bounds on the tracking accuracy could 
be another important next step, too. 
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Abstract. The problem of tracking a varying number of non-rigid ob- 
jects has two major difficulties. First, the observation models and target 
distributions can be highly non-linear and non-Gaussian. Second, the 
presence of a large, varying number of objects creates complex inter- 
actions with overlap and ambiguities. To surmount these difficulties, we 
introduce a vision system that is capable of learning, detecting and track- 
ing the objects of interest. The system is demonstrated in the context of 
tracking hockey players using video sequences. Our approach combines 
the strengths of two successful algorithms: mixture particle filters and 
Adaboost. The mixture particle filter [17] is ideally suited to multi-target 
tracking as it assigns a mixture component to each player. The crucial 
design issues in mixture particle filters are the choice of the proposal 
distribution and the treatment of objects leaving and entering the scene. 
Here, we construct the proposal distribution using a mixture model that 
incorporates information from the dynamic models of each player and the 
detection hypotheses generated by Adaboost. The learned Adaboost pro- 
posal distribution allows us to quickly detect players entering the scene, 
while the filtering process enables us to keep track of the individual play- 
ers. The result of interleaving Adaboost with mixture particle filters is a 
simple, yet powerful and fully automatic multiple object tracking system. 



1 Introduction 

Automated tracking of multiple objects is still an open problem in many settings, 
including car surveillance [10], sports [12,13] and smart rooms [6] among many 
others [5,7,11]. In general, the problem of tracking visual features in complex 
environments is fraught with uncertainty [6]. It is therefore essential to adopt 
principled probabilistic models with the capability of learning and detecting the 
objects of interest. In this work, we introduce such models to attack the problem 
of tracking a varying number of hockey players on a sequence of digitized video 
from TV. 

Over the last few years, particle filters, also known as condensation or se- 
quential Monte Carlo, have proved to be powerful tools for image tracking [3, 
8,14,15]. The strength of these methods lies in their simplicity, flexibility, and 
systematic treatment of nonlinearity and non-Gaussianity. 
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Various researchers have attempted to extend particle filters to multi-target 
tracking. Among others, Hue et. al [5] developed a system for multitarget track- 
ing by expanding the state dimension to include component information, as- 
signed by a Gibbs sampler. They assumed a fixed number of objects. To manage 
a varying number of objects efficiently, it is important to have an automatic 
detection process. The Bayesian Multiple-BLob tracker (BraMBLe) [7] is an 
important step in this direction. BraMBLe has an automatic object detection 
system that relies on modeling a fixed background. It uses this model to identify 
foreground objects (targets). In this paper, we will relax this assumption of a 
fixed background in order to deal with realistic TV video sequences, where the 
background changes. 

As pointed out in [17], particle filters may perform poorly when the posterior 
is multi-modal as the result of ambiguities or multiple targets. To circumvent this 
problem, Vermaak et al introduce a mixture particle filter (MPF), where each 
component (mode or, in our case, hockey player) is modelled with an individual 
particle filter that forms part of the mixture. The filters in the mixture interact 
only through the computation of the importance weights. By distributing the 
resampling step to individual filters, one avoids the well known problem of sample 
depletion, which is largely responsible for loss of track [17]. 

In this paper, we extend the approach of Vermaak et al. In particular, we use 
a cascaded Adaboost algorithm [18] to learn models of the hockey players. These 
detection models are used to guide the particle filter. The proposal distribution 
consists of a probabilistic mixture model that incorporates information from 
Adaboost and the dynamic models of the individual players. This enables us to 
quickly detect and track players in a dynamically changing background, despite 
the fact that the players enter and leave the scene frequently. We call the resulting 
algorithm the Boosted Particle Filter (BPF). 

2 Statistical Model 

In non-Gaussian state-space models, the state sequence {x t ; t £ N}, x t £ R”*, is 
assumed to be an unobserved (hidden) Markov process with initial distribution 
p(xo) and transition distribution p(x t |x t _i), where n x is the dimension of the 
state vector. In our tracking system, this transition model corresponds to a 
standard autoregressive dynamic model. The observations {y t ;t £ N*},y t £ 
R ray , are conditionally independent given the process {x t ;f £ N} with marginal 
distribution p(y t |x t ), where n y is the dimension of the observation vector. 



2.1 Observation Model 

Following [14], we adopt a multi-color observation model based on Hue- 
Saturation- Value (HSV) color histograms. Since HSV decouples the intensity 
(i.e., value) from color (i.e., hue and saturation), it is reasonably insensitive 
to illumination effects. An HSV histogram is composed of N = N s + N v 
bins and we denote 6 t (d) £ {1,...1V} as the bin index associated with the 
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Color histogram of a player (LEFT: white uniform RIGHT: red uniform) 



Fig. 1. Color histograms: This figure shows two color histograms of selected rect- 

angular regions, each of which is from a different region of the image. The player on left 
has uniform whose color is the combination of dark blue and white and the player on 
right has a red uniform. One can clearly see concentrations of color bins due to limited 
number of colors. In (a) and (b), we set the number of bins, N = 110, where Nh, N s , 
and N v are set to 10 



color vector y t (k) at a pixel location d at time t. Figure 1 shows two instances 
of the color histogram. If we define the candidate region in which we formu- 
late the HSV histogram as f?(x t ) = l t + s t W, then a kernel density estimate 
K(x t ) = {k(n\ X-t)}n=i,...,N of the color distribution at time t is given by [1,14]: 



k[n\ x t ) = rj ^2 fi[bt(d)-n] ( 1 ) 

deit(xt) 



where <5 is the delta function, 77 is a normalizing constant which ensures k to 
be a probability distribution, ]P( V , fc(n; x t ) = 1 , and a location d could be any 
pixel location within f?(x t ). Eq. (1) defines k(n ; x 4 ) as the probability of a color 
bin n at time t. 

If we denote K* = {fc*(n; xo)} n =i,...,jv as the reference color model and 
K(x t ) as a candidate color model, then we need to measure the data likelihood 
(i.e., similarity) between K* and K(x t ). As in [1,14], we apply the Bhattacharyya 
similarity coefficient to define a distance £ on HSV histograms. The mathematical 
formulation of this measure is given by [ 1 ]: 



£[K*, K(x t )] 



N 12 

1 - "22 \A* (n;x 0 )fc(n;x t ) 

n= 1 



( 2 ) 



Statistical properties of near optimality and scale invariance presented in [ 1 ] 
ensure that the Bhattacharyya coefficient is an appropriate choice of measuring 
similarity of color histograms. Once we obtain a distance £ on the HSV color 
histograms, we use the following likelihood distribution given by [14]: 

p(y t |x0oce-^[ K *^ x ‘)] 



( 3 ) 
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where A = 20. A is suggested in [14,17], and confirmed also on our experiments. 
Also, we set the size of bins Nh, N s , and N v as 10. 

The HSV color histogram is a reliable approximation of the color density 
on the tracked region. However, a better approximation is obtained when we 
consider the spatial layout of the color distribution. If we define the tracked 
region as the sum of r sub-regions i?(x t ) = Rj(x t ), then we apply the 

likelihood as the sum of the reference histograms {kj}j=i,...,r associated with 
each sub-region by [14]: 

p(y t |x t ) oc e^-o =1 [ k j ’ k ^ ( x *)] ( 4 ) 

Eq. (4) shows how the spatial layout of the color is incorporated into the data 
likelihood. In Figure 2, we divide up the tracked regions into two sub-regions in 
order to use spatial information of the color in the appearance of a hockey player. 




Fig. 2. Multi-part color likelihood model: This figure shows our multi-part color 

likelihood model. We divide our model into two sub-regions and take a color histogram 
from each sub-region so that we take into account the spatial layout of colors of two 
sub-regions 

For hockey players, their uniforms usually have a different color on their 
jacket and their pants and the spatial relationship of different colors becomes 
important. 

2.2 The Filtering Distribution 

We denote the state vectors and observation vectors up to time t by x 0:t = 
{xo . . .x 4 } and yo : t- Given the observation and transition models, the solution 
to the filtering problem is given by the following Bayesian recursion [3] : 

f , x p(ytl x t)p( x t|yo:t-i) 

P( x i|yO:*) — 7 j T 

p(yt|yo:t-i) 

= p(ytl x t)/p( x tl x t-i)p( x t-i|yo : t-i)dx 4 _ 1 ^ 

/ p(ytl x t)p( x t|yo:t-i) 

To deal with multiple targets, we adopt the mixture approach of [17]. 
The posterior distribution, p(x t |yo,t), is modelled as an M-component non- 
parametric mixture model: 
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M 

P(Xt|y 0 :t) = J2 n i’t P i( Xt \yO:t) (6) 

3= 1 

where the mixture weights satisfy J2m=i Hm,t = 1- Using the the filtering dis- 
tribution, pj(x*_i |yo:t-i), computed in the previous step, the predictive distri- 
bution becomes 

M 

P(x t |yO:t-l) = -lPj(x t |y 0 :t-l) (7) 

3=1 

where Pj(x t |y 0:t _i) = /p(x t |x t _i)p i (x t _i|yo:t_i)dx t _i. Hence, the updated 
posterior mixture takes the form 



, , » _ T l ¥=i n 3,t-iPj{yt\xt)Pj{x t \yo-.t-i) 

p tyo.t J2iLi n k, t -i J Pk(yt\x t )p k (x t \yo:t-i) dx. t 



M 

= E 

3 = 1 



Uyt-i f Pj(ytlx i )pj(x t ly 0 ;t- 1 ) dx t 
Yjk=i n k,t-1 f Pfe(yt|xt)pfe(x t |y 0 : t _i) dx 4 _ 

P3(y«l X «)P3'( X *|yO:t-l) 
./P3'(yil X t)P3( X i|yO:t-l) dx t _ 



M 



= E^^( Xt iy° : *) 

3=1 

where the new weights (independent of x t ) are given by: 

n 3 ,t- 1 J Pj(yt\Xt)Pj(Xt\y 0 :t-l) dx t 



Hj,t — 



J 2 kLi n k,t-i J Pk(yt\x t )pk(xt\yo:t-i) dx t 



( 8 ) 



Unlike a mixture particle filter by [17], we have M different likelihood distri- 
butions, {Pj(yt\x t )}j=i...M- When one or more new objects appear in the scene, 
they are detected by Adaboost and automatically initialized with an observa- 
tion model. Using a different color-based observation model allows us to track 
different colored objects. 



3 Particle Filtering 

In standard particle filtering, we approximate the posterior p(x t |yo,t) with a 
Dirac measure using a finite set of N particles {xJ},=i...jv. To accomplish 
this, we sample candidate particles from an appropriate proposal distribution 
x[ ~ <?(x t |x 0 :t _i, y 0 : t) (In the simplest scenario, it is set as (?(x t |x 0:t _i, y 0: t) = 
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p(x t |x t _i), yielding the bootstrap filter [3]) and weight these particles according 
to the following importance ratio: 



w\ = 



p(ytl x tM*tl x t-i) 

9(Xtl X 0:i-l:yO:t) 



(9) 



We resample the particles using their importance weights to generate an un- 
weighted approximation of p(x t |yo : t). In the mixture approach of [17], the parti- 
cles are used to obtain the following approximation of the posterior distribution: 



M 

p( x t|yi:t) E W t S K ( X< ) 

3= 1 Ij 



where lj is the set of indices of the particles belonging to the j-th mixture 
component. As with many particle filters, the algorithm simply proceeds by 
sampling from the transition priors (note that this proposal does not use the 
information in the data) and updating the particles using importance weights 
derived from equation (8); see Section 3 of [17] for details. 

In [17], the mixture representation is obtained and maintained using a simple 
K-means spatial reclustering algorithm. In the following section, we argue that 
boosting provides a more satisfactory solution to this problem. 



4 Boosted Particle Filter 

The boosted particle filter introduces two important extensions of the MPF. 
First, it uses Adaboost in the construction of the proposal distribution. This 
improves the robustness of the algorithm substantially. It is widely accepted 
that proposal distributions that incorporate the recent observations (in our case, 
through the Adaboost detections) outperform naive transition prior proposals 
considerably [15,16]. Second, Adaboost provides a mechanism for obtaining and 
maintaining the mixture representation. This approach is again more powerful 
than the naive K-means clustering scheme used for this purpose in [17]. In par- 
ticular, it allows us to detect objects leaving and entering the scene efficiently. 

4.1 Adaboost Detection 

We adopt the cascaded Adaboost algorithm of Viola and Jones [18], originally 
developed for detecting faces. In our experiments, a 23 layer cascaded classifier 
is trained to detect hockey players. In order to train the detector, a total of 6000 
figures of hockey players are used. These figures are scaled to have a resolution 
of 10 x 24 pixels. In order to generate such data in a limited amount of time, 
we speed the selection process by using a program to extract small regions of 
the image that most likely contain hockey players. Simply, the program finds 
regions that are centered on low intensities (i.e., hockey players) and surrounded 
by high intensities (i.e., rink surface). However, it is important to note that the 
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Fig. 3. Training set of images for hockey players: This figure shows a part of 

the training data. A total of 6000 different figures of hockey players are used for the 
training 




Fig. 4. Hockey player detection result: This figure shows results of the Adaboost 

hockey detector, (a) and (b) shows mostly accurate detections. In (c), there are a set 
of false positives detected on audience by the rink 



data generated by such a simple script is not ideal for training Adaboost, as 
shown in Figure 3. As a result, our trained Adaboost produces false positives 
alongside the edge of the rink shown in (c) of Figure 4. More human intervention 
with a larger training set would lead to better Adaboost results, although failures 
would still be expected in regions of clutter and overlap. The non-hockey-player 
subwindows used to train the detector are generated from 100 images manually 
chosen to contain nothing but the hockey rink and audience. Since our tracker 
is implemented for tracking hockey scenes, there is no need to include training 
images from outside the hockey domain. 

The results of using Adaboost in our dataset are shown in Figure 4. Adaboost 
performs well at detecting the players but often gets confused and leads to many 
false positives. 



4.2 Incorporating Adaboost in the Proposal Distribution 

It is clear from the Adaboost results that they could be improved if we consid- 
ered the motion models of the players. In particular, by considering plausible 
motions, the number of false positives could be reduced. For this reason, we in- 



A Boosted Particle Filter: Multitarget Detection and Tracking 



35 



q(x) 

▲ 




Fig. 5. Mixture of Gaussians for the proposal distribution 



corporate Adaboost in the proposal mechanism of our MPF. The expression for 
the proposal distribution is given by the following mixture. 

Q^b ( x i l x 0 :i— i ; yi:t) = otq ada (x t |x t _! , y t ) + (1 - a)p(x t |x t _!) (10) 

where q ada is a Gaussian distribution that we discuss in the subsequent para- 
graph (See Figure 5). The parameter a can be set dynamically without affecting 
the convergence of the particle filter (it is only a parameter of the proposal 
distribution and therefore its influence is corrected in the calculation of the im- 
portance weights). When a = 0, our algorithm reduces to the MPF of [17]. By 
increasing a we place more importance on the Adaboost detections. We can 
adapt the value of a depending on tracking situations, including cross overs, 
collisions and occlusions. 

Note that the Adaboost proposal mechanism depends on the current obser- 
vation y t . It is, therefore, robust to peaked likelihoods. Still, there is another 
critical issue to be discussed: determining how close two different proposal dis- 
tributions need to be for creating their mixture proposal. We can always apply 
the mixture proposal when q ada is overlaid on a transition distribution modeled 
by autoregressive state dynamics. However, if these two different distributions 
are not overlapped, there is a distance between the mean of these distributions. 

If a Monte Carlo estimation of a mixture component by a mixture particle 
filter overlaps with the nearest cluster given by the Adaboost detection algo- 
rithm, we sample from the mixture proposal distribution. If there is no overlap 
between the Monte Carlo estimation of a mixture particle filter for each mixture 
component and clusters given by the Adaboost detection, then we set a = 0 so 
that our proposal distribution takes only a transition distribution of a mixture 
particle filter. 
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5 Experiments 

This section shows BPF tracking results on hockey players in a digitized video 
sequence from broadcast television. The experiments are performed using our 
non-optimized implementation in C on a 2.8 GHz Pentium IV. 

Figure 6 shows scenes where objects are coming in and out. In this figure, it is 
important to note that no matter how many objects are in the scene, the mixture 
representation of BPF is not affected and successfully adapts to the change. For 
objects coming into the scene, in (a) of the figure, Adaboost quickly detects a 
new object in the scene within a short time sequence of only two frames. Then 
BPF immediately assigns particles to an object and starts tracking it. 

Figure 7 shows the Adaboost detection result in the left column, a frame 
number in the middle, and BPF tracking results on the right. In Frame 32 
and 33, a small crowd of three players in the middle of the center circle on 
the rink are not well detected by Adaboost. This is an example of the case 




(a) Before a new player appears 



(b) 2 frames after 




(c) Before two players disappear 



(d) 8 frames after 



Fig. 6. Objects appearing and disappearing from the scene: This is a demon- 

stration of how well BPF handles objects randomly coming in and out of the scene, 
(a) and (b) show that a new player that appears in the top left corner of the image is 
successfully detected and starts to get tracked, (c) and (d) show that two players are 
disappearing from the scene to left 
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(a) Adaboost detection result 



32 



33 



34 



35 



205 




(b) BPF tracking 



Fig. 7. BPF tracking result: The results of Adaboost detection are shown on the 

left, with the corresponding boosted particle filter results on the right 
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in which Adaboost does not work well in clutter. There are only two windows 
detected by Adaboost. However, BPF successfully tracks three modes in the 
cluttered environment. Frames 34 and 35 show another case when Adaboost 
fails. Since Adaboost features are based on different configurations of intensity 
regions, it is highly sensitive to a drastic increase/decrease of the intensity. The 
average intensity of the image is clearly changed by a camera flash in Frame 34. 
The number of Adaboost detections is much smaller in a comparison with the 
other two consecutive frames. However, even in the case of an Adaboost failure, 
mixture components are well maintained by BPF, which is also clearly shown in 
Frame 205 in the figure. 



6 Conclusions 

We have described an approach to combining the strengths of Adaboost for ob- 
ject detection with those of mixture particle filters for multiple-object tracking. 
The combination is achieved by forming the proposal distribution for the par- 
ticle filter from a mixture of the Adaboost detections in the current frame and 
the dynamic model predicted from the previous time step. The combination of 
the two approaches leads to fewer failures than either one on its own, as well as 
addressing both detection and consistent track formation in the same framework. 

We have experimented with this boosted particle filter in the context of 
tracking hockey players in video from broadcast television. The results show 
that most players are successfully detected and tracked, even as players move in 
and out of the scene. We believe results can be further improved by including 
more examples in our Adaboost training data for players that are seen against 
non-uniform backgrounds. Further improvements could be obtained by dynamic 
adjustment of the weighting parameter selecting between the Adaboost and dy- 
namic model components of the proposal distribution. Adopting a probabilistic 
model for target exclusion [11] may also improve our BPF tracking. 



Acknowledgment. This research is funded by the Institute for Robotics and 
Intelligent Systems (IRIS) and their support is gratefully acknowledged. The 
authors would like to specially thank Matthew Brown, Jesse Hoey, and Don 
Murray from the University of British Columbia for fruitful discussions and 
helpful comments about the formulation of BPF. 



A Boosted Particle Filter: Multitarget Detection and Tracking 



39 



References 

1. Comaniciu, D., Ramesh, V., Meer, P.: Real-Time Tracking of Non-Rigid Objects 
using Mean Shift. IEEE Conference on Computer Vision and Pattern Recognition, 
pp. 142-151 (2000) 

2. Deutscher, J., Blake, A., R.ied, I.: Articulated body motion capture by annealed 
particle filtering. IEEE Conference on Computer Vision and Pattern Recognition, 
(2000) 

3. Doucet, A., de Freitas, J. F. G., N. .1. Gordon, editors: Sequential Monte Carlo 
Methods in Practice. Springer- Verlag, New York (2001) 

4. Freund, Y., Schapire, R. E.: A decision-theoretic generalization of on-line learn- 
ing and an application to boosting. Computational Learning Theory, pp. 23-37, 
Springer- Verlag, (1995) 

5. Hue, C., Le Cadre, J.-P., Perez, P.: Tracking Multiple Objects with Particle Fil- 
tering. IEEE Transactions on Aerospace and Electronic Systems, 38(3):791-812 
(2002) 

6. Intille, S. S., Davis, J. W., Bobick, A.F.: Real-Time Closed-World Tracking. IEEE 
Conference on Computer Vision and Pattern Recognition, pp. 697-703 (1997) 

7. Isard, M., MacCormick, J.: BraMBLe: A Bayesian multiple-blob tracker. Interna- 
tional Conference on Computer Vision, pp. 34-41(2001) 

8. Isard, M., Blake, A.: Condensation - conditional density propagation for visual 
tracking. International Journal on Computer Vision, 28(l):5-28 (1998) 

9. Kalman, R.E.: A New Approach to Linear Filtering and Prediction Problems 
Transactions of the ASME-Journal of Basic Engineering, vol.82 Series D pp. 35-45 
(1960) 

10. Roller, D., Weber, J., Malik, J.: Robust Multiple Car Tracking with Occlusion 
Reasoning. European Conference on Computer Vision, pp. 186-196, LNCS 800, 
Springer- Verlag (1994) 

11. MacCormick, J., Blake, A.: A probabilistic exclusion principle for tracking multiple 
objects. International Conference on Computer Vision, pp. 572-578 (1999) 

12. Misu, T., Naemura, M., Wentao Zheng, Izumi, Y., Fukui, K.: Robust Tracking 
of Soccer Players Based on Data Fusion IEEE 16th International Conference on 
Pattern Recognition, pp. 556-561 vol.l (2002) 

13. Needham, C. J., Boyle, R. D.: Tracking multiple sports players through occlusion, 
congestion and scale. British Machine Vision Conference, vol. 1, pp. 93-102 BMVA 
(2001) 

14. Perez, P., Hue. C, Vermaak, J., Gangnet, M.: Color-Based Probabilistic Tracking. 
European Conference on Computer Vision, (2002) 

15. Rui, Y., Chen, Y.: Better Proposal Distributions: Object Tracking Using Unscented 
Particle Filter. IEEE Conference on Computer Vision and Pattern Recognition, pp. 
786-793 (2001) 

16. van der Merwe, R., Doucet, A., de Freitas, J. F. G., Wan, E: The Unscented 
Particle Filter. Advances in Neural Information Processing Systems, vol. 8 pp 351- 
357 (2000) 

17. Vermaak, J., Doucet, A., Perez, P.: Maintaining Multi-Modality through Mixture 
Tracking. International Conference on Computer Vision (2003) 

18. Viola, P., Jones, M.: Rapid Object Detection using a Boosted Cascade of Simple 
Features. IEEE Conference on Computer Vision and Pattern Recognition (2001) 



Simultaneous Object Recognition and 
Segmentation by Image Exploration* 



Vittorio Ferrari 1 , Tinne Tuytelaars 2 , and Luc Van Gool 1,2 

1 Computer Vision Group (BIWI), ETH Zuerich, Switzerland 
{f errari , vangool}@vision. ee . ethz . ch 
2 ESAT-PSI, University of Leuven, Belgium 
Tinne .TuytelaarsOesat .kuleuven. ac .be 



Abstract. Methods based on local, viewpoint invariant features have 
proven capable of recognizing objects in spite of viewpoint changes, oc- 
clusion and clutter. However, these approaches fail when these factors 
are too strong, due to the limited repeatability and discriminative power 
of the features. As additional shortcomings, the objects need to be rigid 
and only their approximate location is found. We present a novel Object 
Recognition approach which overcomes these limitations. An initial set of 
feature correspondences is first generated. The method anchors on it and 
then gradually explores the surrounding area, trying to construct more 
and more matching features, increasingly farther from the initial ones. 
The resulting process covers the object with matches, and simultaneously 
separates the correct matches from the wrong ones. Hence, recognition 
and segmentation are achieved at the same time. Only very few correct 
initial matches suffice for reliable recognition. The experimental results 
demonstrate the stronger power of the presented method in dealing with 
extensive clutter, dominant occlusion, large scale and viewpoint changes. 
Moreover non-rigid deformations are explicitly taken into account, and 
the approximative contours of the object are produced. The approach 
can extend any viewpoint invariant feature extractor. 



1 Introduction 

Recently, object recognition (OR) approaches based on local invariant features 
have become increasingly popular [8, 5, 2, 4, 7]. Typically, local features are ex- 
tracted independently from both a model and a test image, then characterized 
by invariant descriptors and finally matched. The success of these approaches 
is twofold. First, the feature extraction process and description are viewpoint 
invariant. Secondly, local features bring tolerance to clutter and occlusion, de 
facto removing the need for prior segmentation. In this respect, global methods, 
both contour-based [9] and appearance-based [10], are a step behind. 

In spite of their success, the robustness and generality of these approaches 
are limited by the repeatability of the feature extraction, and the difficulty of 

* This research was supported by EC project VIBES and the Fund for Scientific Re- 
search Flanders. 
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matching correctly, in the presence of large amounts of clutter and challeng- 
ing viewing conditions. Large scale or viewpoint changes considerably lower the 
probability that any given model feature is re-extracted in the test image (e.g.: 
figure 2, left). Simultaneously, occlusion reduces the number of visible model fea- 
tures. The combined effect is that only a small fraction of model features has a 
correspondence in the test image. This fraction represents the maximal number 
of features that can be correctly matched. Unfortunately, at the same time ex- 
tensive clutter gives rise to a large number of non-object features, which disturb 
the matching process. As a final outcome of these combined difficulties, only a 
few, if any, correct matches are produced. Because these often come together 
with many mismatches, recognition tends to fail. 

Even in easier cases, to suit the needs for repeatability in spite of viewpoint 
changes, only a sparse set of distinguished, features [7] are extracted. As a result, 
only a small portion of the object is typically covered with matches. Densely 
covering the visible part of the object is desirable, as it increases the evidence 
for its presence, which results in higher discriminative power. 

In this paper, we face these problems by no longer relying solely on matching 
viewpoint invariant features. Instead, we propose to anchor on an initial set 
thereof, and then look around them trying to construct more matching features. 
As new matches arise, they are exploited to construct even more, in a process 
which gradually explores the test image, recursively constructing more and more 
matches, increasingly farther from the initial ones. As the number and extent 
of matched features increases, so does the information available to judge their 
individual correctness. Gradually the system’s confidence in the presence of the 
object grows. 

We build upon a multi-scale extension of the affine invariant region extrac- 
tor of [2]. An initial large set of unreliable region correspondences is generated 
through a process tuned to maximize the amount of correct matches, at the 
cost of producing many mismatches (section 2) . Additionally, we generate a grid 
of circular regions homogeneously covering the model image. The core of the 
method iteratively alternates between expansion phases, where correspondences 
for these coverage regions are constructed, and contraction phases, which at- 
tempt to remove mismatches. In the first expansion phase (section 3), we try to 
propagate the coverage regions based on the geometric transformation of nearby 
initial matches. By propagating a region, we mean constructing the correspond- 
ing one in the test image. The propagated matches and the initial ones are then 
passed through a novel local filter, during the first contraction phase (section 
4). The processing continues by alternating faster expansion phases (section 5), 
where coverage regions are propagated over a larger area, with contraction phases 
based on a global filter (section 6) . The filter exploits both topological arrange- 
ments and appearance information, and tolerates non-rigid deformations. During 
the expansion phases, the shape of each new region is adapted to the local sur- 
face orientation, thus allowing the exploration process to follow curved surfaces 
and deformations (e.g. a folded magazine). At each iteration, the presence of the 
newly propagated matches helps the filter to take better removal decisions. In 
turn, the cleaner set of supports makes the next expansion more effective. As a 
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Fig. 1 . Scheme of the system 



result, the amount, and the percentage, of correct matches grows every iteration. 
The algorithm is getting a clearer idea about the object’s presence and location. 
The two closely cooperating processes of expansion and contraction gather more 
evidence about the presence of the object and separate correct matches from 
wrong ones at the same time. This results in the simultaneous recognition and 
segmentation of the object. By constructing matches for the coverage regions, 
the system succeeds in covering also image areas which are not interesting for 
the feature extractor or not discriminative enough to be correctly matched by 
traditional techniques. 

The basic advantage of the approach is that each single correct initial match 
can expand to cover a contiguous surface with many correct matches, even when 
starting from a large amount of mismatches. This leads to filling the visible 
portion of the object with matches. Some interesting direct advantages derive 
from it. First, robustness to scale, viewpoint, occlusion and clutter are greatly 
enhanced, because most cases where the traditional approach generated only a 
few correct matches are now solvable. Second, discriminative power is increased, 
because decisions about the object’s identity are based on information densely 
distributed over the entire portion of the object visible in the test image. Third, 
the approximate boundary of the object in the test image is directly suggested 
by the final set of matched regions (section 8). Fourth, non-rigid deformations 
are explicitly taken into account. 

2 Soft Matches 

The feature extraction algorithm [2] is applied to both a model image I m and a 
test image It independently, producing two sets of regions 

Tentative Matches 

For each test region T £ we compute the Mahalanobis distance of the in- 
variant descriptors [2] to all model regions M £ An appearance similarity 
measure sim(T, M ) is computed between T and each of the 10 closest regions. 
The measure is a linear combination of grey-level normalized cross-correlation 
(NCC) and the average Euclidean distance in RGB space, after geometric and 
photometric normalization. This mixture is more discriminant than NCC alone, 
while keeping invariance to brightness changes. We consider each of the 3 most 
similar regions above a low threshold t\ . Repeating this operation for all regions 
T £ I>t, yields a first set of tentative matches. At this point, every test region 
could be matched to either none, 1,2 or 3 model regions. 
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Refinement and Re-thresholding 

Since all regions are independently extracted from the two images, the geo- 
metric registration of a correct match might not be optimal, which lowers its 
similarity. The registration of the tentative matches is refined using our recently 
proposed algorithm [1], that efficiently looks for the affine transformation that 
maximizes the similarity. After refinement, the similarity is re-evaluated and only 
matches scoring above a second, higher threshold t -2 are kept. Refinement tends 
to raise the similarity of correct matches much more than that of mismatches. 
The increased separation between the similarity distributions makes the second 
thresholding more effective. 

The obtained set of matches usually still contains soft-matches, i.e. more than 
one region in <P m corresponding to the same region in <P t , or vice-versa. This con- 
trasts with classic matching methods [7,2,5,11,8], but there are two good reasons 
for it. First, the scene might contain repeated, or visually similar elements. Sec- 
ondly, large viewpoint and scale changes cause loss of resolution which results 
in a less accurate correspondence and a lower similarity. When there is also ex- 
tensive clutter, it might be impossible, based purely on local appearance [14], to 
decide which of the top-3-matches is correct, as several competing regions might 
appear very similar, and score higher than the correct match. 

The proposed process outputs a large set of plausible matches, all with a rea- 
sonably high similarity. The goal is to maximize the amount of correct matches, 
even at the cost of accepting a substantial fraction of mismatches. In difficult 




Fig. 2. Left: case-study (top: model image, bottom: test image). Middle: a closer view 
with 3 initial matches. The two model regions on the left are both matched to the same 
region in the test image. Note the small occluding rubber on the spoon. Right-top: the 
homogeneous coverage 12. Right-bottom: a support region (dark), associated sectors 
(lines) and candidates (bright) 
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cases this is important, as each correct match can start an expansion which will 
cover significant parts of the object. 

Figure 2 shows a case-study, for which 3 correct matches out of 217 are found 
(a correct-ratio of 3/217). The large scale change (factor 3.3), combined with the 
modest resolution (720x576), causes heavy image degradation which corrupts 
edges and texture. In such conditions only a few model regions are re-extracted 
and many mismatches are inevitable. In the remainder of the paper, we refer to 
the current set of matches as the configuration r. 

How to proceed ? Global, robust geometry filtering methods, like detecting 
outliers to the epipolar geometry through RANSAC [3] fail, as they need a 
minimal amount of inliers of about 30% [8] . Initially, this may very well not be 
the case. Even if we could separate out the few correct matches, they would not 
be sufficient to draw reliable conclusions about the presence of the object. In the 
following we explain how to gradually increment the number of correct matches 
and simultaneously decrease the number of mismatches. 

3 Early Expansion 

Coverage of the Model Image 

We generate a grid 17 of overlapping circular regions densely covering the model 
image I m (figure 2, top-right). The expansion phases will try to construct in I t 
as many regions corresponding to them as possible. 



Propagation Attempt 

We now define the concept of propagation attempt which is the basic building- 
block of the expansion phases and will be used later. Consider a region C m 
in model image I m without match in the test image It and a nearby region 
S mi matched to St- If C m and S m lie on the same physical facet of the object, 
they will be mapped to It by similar affine transformations. The support match 
( S m ,St ) attempts to propagate the candidate region C m to It as follows: 

1. Compute the affine transformation A mapping S m to St- 

2. Project C m to I t via A : C t = AC m . 

The benefits of exploiting previously established geometric transformations was 
also noted by [13] . 

Early Expansion 

Propagation attempts are used as follows. Consider as supports {S'* = ( S l m , 5^)} 
the soft-matches configuration T, and as candidates A the coverage regions 17. 
For each support region S l m we partition I m into 6 circular sectors centered on 
the center of S ^ (figure 2, bottom-right). Each S l m attempts to propagate the 
closest candidate region in each sector. As a consequence, each candidate C m 
has an associated subset fc m C T of supports that will compete to propagate 
it. For a candidate C m and each support S l in do: 
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1. Generate C\ by attempting to propagate C m via S r . 

2. Refine C l t . If C[ correctly matches C m , this adapts it to the local surface 
orientation (handles curved and deformable objects) and perspective effects 
(the affine approximation is only valid on a local scale). 

3. Evaluate the quality of the refined propagation attempt: sirrii = sim(C m , C[) 

We retain C^ est , with best = arg max,; sirrii, the best refined propagation at- 
tempt. C m is considered successfully propagated to C^ est if sirribest > O (the 
matching threshold). This procedure is applied for all candidates C m £ A. 

Most support matches may actually be mismatches, and many of them typ- 
ically lie around each of the few correct ones (e.g.: several matches in a single 
soft-match, figure 2, middle). In order to cope with this situation, each sup- 
port concentrates its efforts on the nearest candidate in each direction, as it 
has the highest chance to undergo a similar geometric transformation. Addition- 
ally, every propagation attempt is refined before evaluation. Refinement raises 
the similarity of correctly propagated matches much more than the similarity 
of mispropagated ones, thereby helping correct supports to win. This results 
in a limited, but controlled growth, maximizing the chance that each correct 
match propagates, and limiting the proliferation of mispropagations. The pro- 
cess also restricts the number of refinements to at most 6 per support (contains 
computational cost). 

For the case-study, 113 new matches are generated and added to the config- 
uration r. 17 of them are correct and located around the initial 3. The correct- 
ratio of r improves to 20/330 (figure 4, left), but it is still very low. 

4 Early Contraction 

The early expansion guarantees high chances that each initial correct match 
propagates. As initial filter, we discard all matches that did not succeed in prop- 
agating any region. The correct-ratio improves to 20/175 (no correct match is 
lost), but it is still too low for applying a global filter. Hence, we have developed 
the following local filter. 

A local group of regions in the model image have uniform shape, are arranged 
on a grid and intersect each other with a specific pattern. If all these regions are 
correctly matched, the same regularities also appear in the test image, because 
the surface is contiguous and smooth (regions at depth discontinuities can’t be 
matched correctly anyway). This holds for curved or deformed objects as well, 
because the affine transformation varies slowly and smoothly across neighbor- 
ing regions (figure 3, left). On the other hand, mismatches tend to be located 
elsewhere in the image and to have different shapes. We propose a novel, local 
filter based on this observation. Let {N^} be the neighbors of a region R m in 
the model image. Two regions A, B are considered neighbors if they intersect, 
i.e.: if Area(Afj B) > 0. Only neighbors which are actually matched to the test 
image are considered. Any match ( R m ,R t ) is removed from r if 

, Arca(i? m p| N^) _ Area(i? t Pi . , 

Area (R m ) Area (R t ) 
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Fig. 3. Left: the regular arrangement of the regions is preserved. Middle: top: a can- 
didate (thin) and 2 of 20 supports (thick) within the large circular area, bottom: the 
candidate is propagated to the test image using the affine transformation of the sup- 
port on the right. Refinement adapts the shape to the perspective (brighter). Right: 
sidedness constraint. R 1 is on the same side of the line in both images 



with t s some threshold. The filter tests the preservation of the pattern of intersec- 
tions between R and its neighbors (the ratio of areas is affine invariant). Hence, 
a removal decision is based solely on local information. As a consequence, this 
filter is unaffected by the current, low overall ratio of correct matches. Shape 
information is integrated in the filter, making it capable of spotting insidious 
mismatches which are roughly correctly located, yet have a wrong shape. This 
is an advantage over the (semi-) local filter proposed by [6], and later also used 
by others [14], which verifies if a minimal amount of regions in an area around 
R m in the model image also match near R t in the test image. 

The input regions need not be arranged in a regular grid, the filter applies 
to a general set of (intersecting) regions. Note that incorrectly matched regions 
with no neighbors will not be detected. The algorithm can be implemented to 
run in 0(|J n | + x ), with x -C |T| 2 the number of region intersections. 

Applying this filter to the case-study brings the correct-ratio of r to 13/58, 
thereby greatly reducing the number of mismatches. 



5 Main Expansion 



The first ’early’ expansion and contraction phases brought several additional cor- 
rect matches and removed many mismatches, especially those that concentrated 
around the correct ones. Since r is cleaner, we can now try a faster expansion. 
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All matches in the current configuration r are removed from the candidate 
set A <— A\r, and are used as supports. All support regions in a circular 
area 1 around a candidate C m compete to propagate it: 

1. Generate C\ by attempting to propagate C m via S l 

2. Evaluate sirrii = sim(C m , Cl) 

We retain C^ est , with best = arg ma,x ;: sirrii and refine it, yielding C t re ^. C m 
is considered successfully propagated to Cf. ef if sim((7 m , Cf e ^) > t 2 (figure 3, 
middle). This scheme is applied for each candidate. 

In contrast to the early expansion, many more supports compete for the same 
candidate, and no refinement is applied before choosing the winner. However, 
the presence of more correct supports, now tending to be grouped, and fewer 
mismatches, typically spread out, provides good chances that a correct support 
will win a competition. In this process each support has the chance to propagate 
many more candidates, spread over a larger area, because it offers help to all 
candidates within a wide circular radius. This allows the system to grow a mass 
of correct matches. Moreover, the process can jump over small occlusions or 
degraded areas, and costs only one refinement per candidate. 185 new matches, 
61 correct, are produced for the case-study, thus lifting the correct-ratio of r to 
74/243 (30.5%, figure 4, middle). 

6 Main Contraction 

At this point the chances of having a sufficient number of correct matches to try 
a global filter are much better. In contrast to the local filter of section 4, the fol- 
lowing global filter is capable of finding also isolated mismatches. The algorithm 
extends our topological filter in [1] to include also appearance similarity. 

Figure 3 (right) illustrates the property on which the filter is based. The 
center of a region R 1 should be on the same side of the directed line going from 
the center of a second region R 2 to the center of a third region R 3 in both 
the model and test images (noted side(i? 1 , i? 2 , R 3 )). This sidedness constraint 
holds for all correctly matched triples of coplanar regions and also for most non- 
coplanar ones [1]. It does not hold for non-coplanar triples in presence of strong 
parallax in a few cases, coined parallax-violations [1]. 

A triple including any mismatched region has higher chances to violate the 
constraint. When this happens, we can only conclude that probably at least 
one of the matches is incorrect, but we do not yet know which. However, by 
integrating the weak information each triple provides, it is possible to robustly 
discover mismatches. Hence, we check the constraint for all unordered triples 
and we expect wrong matches to be involved in a higher share of violations: 

err topo (R') = ^ £ |side(i&, i&, R k J ~ side(i?j, Rl, R*)\ (1) 

Ri ,R k ^r\R\j>k 



1 In all experiments the radius is set to 1/6 of the image size. 
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with v = (n — 1 )(n — 2)/2,n = | J^|. err topo (i?*) £ [0, 1] because it is normalized 
w.r.t. the maximum number of violations v any region can be involved in. As 
a novel extension to [1], the topological error share (1) is combined with an 
appearance term, giving the total error 

err to t(-ff) = err t0 po(ff) + {h ~ sim(i?^, R\)) 

The filtering algorithm goes as follows: 

1. (Re-)compute err to t(-R l ) for all R l £ F. 

2. Find the worst match R w , with w = arg max^ err to t(R I ) 

3. If err tot (R w ) > 0, remove R w : r «— ( r\R w ), and iterate to 1, else stop. 

The idea of the algorithm is that at each iteration the most probable mis- 
match R w is removed and the error of correct matches decreases, because 
they are involved in less triples containing any mismatch. After several itera- 
tions, ideally only correct matches are left and the algorithm stops. The second 
term of err to t decreases with increasing appearance similarity, and it vanishes 
when sim(f?^, R\) = t 2 , the matches acceptance threshold. The removal crite- 
ria err to t > 0 expresses the idea that topological violations are accepted up to 
the degree to which they are compensated by high similarity. This helps finding 
mismatches which can hardly be judged by only one cue. A typical mismatch 
with similarity just above £ 2 , will be removed unless it is perfectly topologically 
located. Conversely, correct matches with err top o > 0 due to parallax-violations 
are in little danger, because they typically have good similarity. Including ap- 
pearance makes the filter more robust to low correct-ratios, and remedies the 
drawback (parallax-violations) of the purely topological filter [1], 

The proposed method offers two main advantages over rigid-motion filters, 
traditionally used in the matching literature [2,5,4,13,7,14], e.g.: detecting out- 
liers to the epipolar geometry through RANSAC [3]. First, it allows for non-rigid 
deformations, like the bending of paper or cloth, because the structure of the 
spatial arrangements, captured by the sidedness constraints, is stable under these 
transformations. Second, it is much less sensitive to inaccurate localizations, be- 
cause errtopo varies slowly and smoothly for a region departing from its ideal 
location. 

Topological configurations of points and lines are also used in [15], which 
enforces the cyclic ordering of line segments connecting corners as a mean for 
steering the matching process. 

In the case-study, the filter starts from 74/243 and returns 54/74, which is 
a major improvement. 20 correct matches are lost, but many more mismatches 
(149) are removed. The further processing will recover the lost correct matches 
and generate even more. 

7 Exploring the Test Image 

The processing continues by iteratively alternating main expansion and main 
contraction phases: 
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1. Do a main expansion phase. This produces a set of propagated region 
matches T, which are added to the configuration: T •<— (P(JT). 

2. Do a main contraction phase on r. 

3. If at least one newly propagated region survives the contraction, i.e. 
\rr\r\ > o, then iterate to 1, after updating the candidate set to con- 
tain A <— ( f2\r ), all original candidate regions 12 which are not yet in the 
configuration. 

In the first iteration, the expansion phase generates some correct matches, along 
with some mismatches, thereby increasing the correct-ratio. The first main con- 
traction phase removes mostly mismatches, but might also lose several correct 
matches: the amount of noise could still be high and limit the filter’s perfor- 
mance. In the next iteration, this cleaner configuration is fed into the expansion 
phase again which, less distracted, generates more correct matches and fewer 
mismatches. The new correct matches in turn help the next contraction stage in 
taking better removal decisions, and so on. As a result, the amount, percentage 
and spatial extent of correct matches increase at every iteration, reinforcing the 
confidence about the object’s presence and location. The two goals of separating 
correct matches and gathering more information about the object are achieved 
at the same time. 

Correct matches erroneously killed by the contraction step in an iteration get 
another chance during the next expansion phase. With even fewer mismatches 
present, they are probably regenerated, and this time have higher chances to 
survive the contraction (higher correct-ratio, more positive evidence present). 

Thanks to the refinement, each expansion phase adapts the shape of the 
newly created regions to the local surface orientation. Thus the whole exploration 
process follows curved surfaces and deformations. 

The exploration procedure tends to ’implode’ when the object is not in the 
test image, typically returning 0, or at most a few matches. Conversely, when the 
object is present, the approach fills the visible portion with many high confidence 
matches. This yields high discriminative power and the qualitative shift from 
only detecting the object to knowing its extent in the image and which parts are 
occluded. Recognition and segmentation are intensely intertwined. 




Fig. 4. Case-study. Left: 20 correct matches (dark) out of 330 after early expansion. 
Middle: 74/243 after the first main expansion. Right: contour of the final set of matches. 
Note the segmentation quality, in particular the detection of the occluding rubber 
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In the case-study, the second main expansion propagates 141 matches, 117 
correct, which is better than the previous 61/185. The second main contraction 
starts from 171/215 and returns 150/174, killing a lower percentage of correct 
matches than the first main contraction. After the 11th iteration 220 matches 
cover the whole visible part of the object (figure 4, right). 

8 Results and Conclusion 

We report results for a set of 9 model objects and 23 test images. In total, the 
objects appear 43 times, as some test images contain several objects. There are 3 
planar objects, each modeled by a single view, including a Kellogs box (figure 2), 
and two magazines Michelle (figure b2) and Blonde (analog model view). Two 
objects with curved shapes, Xmas (g2) and Ovo (el), have 6 model views. Leo 
(cl), Car (fl), Suchard (d2) feature more complex 3D shape and have 8 model 
views. Finally, one frontal view models the last 3D object, Guard (al). Multiple 
model views are taken equally spaced around the object. The contributions from 
all model views are integrated by superimposing the area covered by the final 
set of matched regions (to find the contour), and by summing their number 
(recognition criteria). All images are shot at a modest resolution (720x576) and 
all experiments are conducted with the same set of parameters. In general, in 
the test cases there is considerable clutter and the objects appear smaller than 
in the models (all models are shown at the same scale as the test images). 

Tolerance to deformations is shown in a2, where Michelle is simultaneously 
strongly folded and occluded. The contours are found with a good accuracy, 
extending to the left until the edge of the object. Note the extensive clutter. 
High robustness to viewpoint changes is demonstrated in bl, where Leo is only 
half visible and captured in a considerably different pose than any of the model 
views, while Michelle undergoes a very large out-of-plane rotation of about 80 
degrees. Guard , occluding Michelle , is also detected in the image, despite a scale 
change of factor 3. In dl, Leo and Ovo exhibit significant viewpoint change, while 
Suchard is simultaneously scaled factor 2.2 and 89% occluded. This very high 
occlusion level makes this case challenging even for a human observer. A scale 
change of factor 4 affecting Suchard is illustrated in e2. In figure f2, Xmas is 
divided in two by a large occludor. Both visible parts are correctly detected by 
the presented method. On the right size of the image, Car is found even if half 
occluded and very small. Car is also detected in spite of considerable viewpoint 
change in gl. The combined effects of strong occlusion, scale change and clutter 
make h2 an interesting case. Note how the boundaries of Xmas are accurately 
found, and in particular the detection of the part behind the glass. As a final 
example, 8 objects are detected at the same time in i2 (for clarity, only 3 contours 
are shown). Note the correct segmentation of the two deformed magazines and 
the simultaneous presence of all the aforementioned difficulty factors. 

Figure hi presents a close-up on one of 93 matches produced between a 
model view of Xmas (left) and test case h2 (right). This exemplifies the great 
appearance variation resulting from combined viewpoint, scale and illumination 
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changes, and other sources of image degradation (here a glass). In these cases, 
it is very unlikely for the region to be detected by the initial region extractor, 
and hence traditional methods fail. This figure also illustrates the accuracy of 
the correspondences generated by the expansion phases. 

As a proof of the method’s capability of following deformations, we tried 
to process the case in c2 starting with only one match (dark). 356 regions, 
covering the whole object, were produced. Each region’s shape fits the local 
surface orientation (for clarity, only 3 regions are shown). 

The discriminative power of the system was assessed by processing all pairs 
of model-object and test images, and counting the resulting amount of region 
matches. The highest ROC curve in figure il depicts the detection rate versus 
false-positive rate, while varying the detection threshold from 0 to 200 matches. 
The method performs very well, and can achieve 98% detection with 6% false- 
positives. For comparison, we processed the dataset also with 4 state-of-the-art 
affine region extractors [7,5,11,2], and described the regions with the SIFT [8] 
descriptor 2 , which has recently been demonstrated to perform best [12]. The 
matching is carried out by the ’unambiguous nearest-neighbor’ approach 3 ad- 
vocated in [11,8]: a model region is matched to the region of the test image with 
the closest descriptor if it is closer than 0.7 times the distance to the second- 
closest descriptor (the threshold 0.7 has been empirically determined to optimize 
results). Each of the central curves in il illustrates the behavior of a different 
extractor. As can be seen, none is satisfactory, which demonstrates the higher 
level of challenge offered by the dataset and therefore suggests that our approach 
can broaden the range of solvable OR cases. Closer inspection reveals the source 
of failure: typically only very few, if any, correct matches are produced when 
the object is present, which in turn is due to the lack of repeatability and the 
inadequacy of a simple matcher under such difficult conditions. The important 
improvement brought by the proposed method is best quantified by the differ- 
ence between the highest curve and the central thick curve, representing the 
system we started from [2] (labeled ’[2] org’ in the plot). 

The experiments confirm the power of the presented approach in solving 
very challenging cases. Moreover, non-rigid deformations are explicitly taken 
into account, and the approximate boundaries of the object is found, two fea- 
tures lacking in competing approaches [4,8,2,7,11,5,14]. The method is of general 
applicability, as it works with any affine invariant feature extractor. Future work 
aims at better exploiting the relationships between multiple model-views, at ex- 
tending the scope to less richly textured objects, and at improving computational 
efficiency (currently, a 1.4 Ghz computer takes some minutes to process a pair 
of model and test images). 



2 All region extractors and the SIFT descriptor are implementations of the respective 
authors. We are grateful to Jiri Matas, Krystian Mikolajczyk, Andrew Zisserman, 
Cordelia Schmid and David Lowe for providing the programs. 

3 We have also tried the standard approach, used in [7,5,2,12], which simply matches 
two nearest-neighbors if their distance is below a threshold, but it produced slightly 
worse results. 
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Abstract. We present a probabilistic framework for recognizing objects 
in images of cluttered scenes. Hundreds of objects may be considered 
and searched in parallel. Each object is learned from a single training 
image and modeled by the visual appearance of a set of features, and 
their position with respect to a common reference frame. The recognition 
process computes identity and position of objects in the scene by finding 
the best interpretation of the scene in terms of learned objects. Features 
detected in an input image are either paired with database features, or 
marked as clutters. Each hypothesis is scored using a generative model 
of the image which is defined using the learned objects and a model for 
clutter. While the space of possible hypotheses is enormously large, one 
may find the best hypothesis efficiently - we explore some heuristics to do 
so. Our algorithm compares favorably with state-of-the-art recognition 
systems. 



1 Introduction 

In the computer vision literature there is broad agreement that objects and 
object categories should be represented as collections of parts (or features) which 
appear in a given mutual position or shape (eg side-by-side eyes, a nose below 
them etc) . Each feature contains local information describing the image content 
[2,3]. There is, however, disagreement as to the best tradeoff in this design space. 
On one hand, one may wish to represent the appearance and position of parts in 
a careful probabilistic framework, which allows to generate principled learning 
and detection algorithms. One example of this approach is the ‘constellation 
model’ [4] which has been successfully applied to unsupervised learning and 
recognition of object categories amongst clutter [5,6]. This approach is penalized 
by a large number of parameters that are needed to represent appearance and 
shape and by algorithmic complexity - as a result there is a practical limit 
to the size of the models that one can use, typically limiting the number of 
object parts below 10. On the other hand, one finds in the literature models 
containing hundreds of features. In this case the authors dramatically simplify 
the way appearance and position are modeled as well as the algorithms used to 
learn and match models to images. A representative of this approach is David 
Lowe’s algorithm [7,8] which can recognize simultaneously and quickly multiple 
individual objects (as opposed to categories). 



T. Pajdla and J. Matas (Eds.): ECCV 2004, LNCS 3021, pp. 55—68, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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We are interested in exploring whether probabilistically rigorous modeling 
may be extended to yield practical data structures and algorithms for mod- 
els that contain hundreds of features. To this end, we modify the constellation 
model [6,9] to incorporate a number of attractive features presented by Lowe: 
using a KD-tree for efficiently associating features by appearance as well as com- 
puting feature positions with respect to a common reference frame rather than 
with respect to each other. Additionally, we pool representational parameters 
amongst features. As a result, it is possible to learn models quickly based on 
a single example; additionally, the system gains the robustness associated with 
using a large number of features while also offering an expressive probabilistic 
model for verifying object presence. One additional contribution is exploring 
efficient algorithms for associating models with images that are based on this 
probabilistic model and the A* search technique [10,11]. 

In section 2 we review the feature matching and constellation model ap- 
proaches upon which this paper builds. Section 3 details the probabilistic frame- 
work used in our recognition system. Section 4 describes the algorithm for incre- 
mentally constructing a high probability hypothesis without exploring the entire 
hypothesis space. In section 5, we discuss the task of learning. In section 6, we 
compare our systemes performance against that of a pure feature matching ap- 
proach. Finally, in section 7, we present conclusions and discuss areas for further 
research. 



2 Related Research 

A feature-based recognition approach recently developed by Lowe [7,8] consists 
of four stages: feature detection, extraction of feature correspondences, pose pa- 
rameter estimation, and verification. Features are computed over multiple scales, 
at positions that are extrema of a difference-of-Gaussian function. An orienta- 
tion is assigned to a feature using the histogram of local image gradients. Each 
feature’s appearance is represented by a vector constructed from the local image 
region, sampled relative to the feature orientation. A k-d tree structure, mod- 
ified with backtracking for search efficiency [12], is the central component of 
a database used to perform efficient appearance-based feature matching. Each 
match between scene and model features suggests a position, orientation, and 
scale for the model within the scene. Recognition is achieved by grouping sim- 
ilar model poses using a Hough transform and then explicitly solving for the 
transformation from model to scene coordinates. 

The constellation model [4, 5, 6, 9] also relies on matching image parts, but typ- 
ically uses on the order of 5 features, whereas Lowe uses hundreds of features. 
Rather than restricting features to a rigid position, the constellation model uses 
a joint probability density on part positions. In addition, a probabilistic model 
for feature appearance is used, permitting the quality of matches to be measured. 
One drawback of the constellation model is the high number of training samples 
required, although recent work by Fei Fei et al [13] proved that learning can be 
efficiently achieved with few examples. Another disadvantage of the constellation 
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model lies in the large computation time required in order to learn feature con- 
figurations, limiting it to use of a relatively small number of parts for each object 
category. In our adaptation of the constellation approach for individual object 
recognition, this limitation disappears, at a slight cost to the model’s generality. 

3 Probabilistic Framework 

We model individual objects as constellations of features. Features are gener- 
ated by applying Lowe’s feature detector [7,8] to each training image. Each 
feature has a position, orientation, and scale within the object model as well as 
a feature vector describing its appearance. We learn probabilistic models for fore- 
ground and background feature appearance. The collection of models extracted 
from training images, together with a k-d tree of model features searchable by 
appearance, forms the database. 

Features are generated from a scene using the same procedure applied to 
training images. If a model is present, each of its features has a chance of ap- 
pearing as a scene feature. We also expect spurious background detections in 
the scene. A hypothesis assigns each scene feature to either the background or 
a model feature. It also specifies the pose of each model present in the scene. A 
hypothesis may indicate the presence of multiple instances of the same object, 
each in a different pose. 

The task of the recognition algorithm is to find the hypothesis that best 
explains the scene. The solution is the hypothesis with maximum probability 
conditioned on both the observed scene features and the database. 



3.1 Hypothesis Valuation 

Let O denote the set of observed scene features, V the database, and H a hy- 
pothesis. We define the valuation of H by v(H) = p(H\0 } V). Using Bayes rule, 



v(H)=p(H\0,V) 



p(Q\H,V)p(H\V) 

P(0[D) 



(1) 



The desired output of the recognition algorithm is the hypothesis H maximizing 
this valuation. In particular, 



H = argmax ( \ = argmax ( p(0\H , V)p(H\D)) (2) 

Hen V p{0\D) J Hen 



where H denotes the set of all hypotheses and we dropped the constant p(0\V). 

In order to evaluate these probabilities, we expand a hypothesis into several 
components. The hypothesis states which objects are in the scene and where 
those objects are detected in the scene. Let m denote the number of object 
detections predicted by hypothesis H. Then, for i = 1. . .to, H specifies the 
model, Mi £ V , of the i th detected object, as well a set of parameters, Z,:, 
describing that model’s pose in the scene. 
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In addition to stating the position of detected objects, hypothesis H at- 
tributes their appearance to features found in the scene. In particular, H breaks 
the set of scene features, O, into to- 1-1 disjoint sets, O 0 . . . O m , where O 0 is the 
set of features attributed to the background and for i = 1. . .m, Oi is the set 
of features attributed to model Mi. To specify the exact pairing between scene 
features in Oi and model features in Mi , we introduce two auxiliary variables, dj 
and hj. The binary vector dj indicates which features of Mi are detected (value 
1) and which features are missing (value 0). Vector hj also contains an entry for 
each feature j of Mi. If j is detected (djj = 1), then h ?:j indicates the element of 
Oi to which j corresponds. In other words, hj maps indices of detected model 
features to indices of their corresponding scene features. 

For notational convenience, we define the single vector h to contain the entire 
correspondence map between scene features and model features (or background) . 
h is simply the concatenation of all the hj’s. Also, n denotes the number of back- 
ground features, or equivalently, the size of Oq. Together, h, n, {di,... ,d m }, 
{Z\, . . . , Z m }, and {Mi, . . . , M m } completely specify a hypothesis. These vari- 
ables contain all detection, pose, and feature correspondence information. 

Using this decomposition, we now return to the computation of the valuation 
of a hypothesis. From equation (2) we can redefine the hypothesis valuation as 

v'(H)=p(0\H,V)-p(H\V) (3) 

3.2 Pose and Appearance Density 

The term p{0\H 1 T>) characterizes the probability density in location, scale, ori- 
entation, and appearance for the features detected in the scene image. Condi- 
tioning on the pose of models present in hypothesis H , we can assume that 
features attributed by H to different model objects are mutually independent: 



P(0\H, V) =p{0 |h, n, {dj}, {Zi}, {Mi},V) 



=p(O 0 \n,D) ■ JJp(Oj|hj,dj,Zj,Mj,T>) 

i = 1 



( 4 ) 



— p(Oo\n,T>) is the probability that the n background detections would occur 
at the exact positions and with the exact appearances specified in Oq. We 
assume each point in the (location, orientation, scale) space examined by 
the feature generator has an equal chance of producing a spurious detection. 
Assuming that clutter detections are independent from each other, 



p(O 0 \n,D) 



1 

A 



1 ' 
2n 



n \ v ) 

xeOo 



( 5 ) 



where A is the number of pixels in the Gaussian pyramid used for feature 
detection, or equivalently, the size of the (location, scale) space, and there 
is a range of 27 t in possible values for orientation. pbg(x|2?) is the density 
describing the appearance of background features. 
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— p(Oi\hi,di,Zi,Mi,T>) is the probability that the detections of the model 
features indicated by the hypothesis would occur with the exact pose and 
appearance specified in Oi . Conditioning on model pose, we will assume in- 
dependence between model features. This is a key assumption distinguishing 
our model from the constellation model [4,5,6]. Thus, 

p(C’ l |h l ,d i ,Z i ,M i ,T>) = n Ppose (^l ^2 5 Zi,Mi,V) ■ p fg (x|hj, dj, Mi,T>) 

xeOi 

( 6 ) 

where p pose and pf s are the pose and appearance probabilities, respectively, 
for the foreground features. 

The discussion on learning in section 5 describes the technique used for estimat- 
ing probability densities pbg> Ppose, and pf s . 



3.3 Hypothesis Prior 

The term p(H\T>) is the prior on the hypothesis. We expand this term as 



p{H\V) 



p(h, n, {d J, {ZJ, {MJ|T>) 

p(h\n, {d<}, {Zi}, {M X },V) ■ p(n\{di}, {Z t }, 



m 

■ JJp(d i |Z i ,M i ,2?) 

. 2=1 



■p({Z z },{M z } \V) 



( 7 ) 



p(h\n,{di},{Zi},{Mi},T>) is the probability of a specific set of feature as- 
signments. As h is simply a vector of indices mapping model features to 
scene features, and we have no information on scene feature appearance or 
position at this stage, all mappings that predict n background features and 
are consistent with the detection vectors {d,} are equally likely. Hence, 



p{h\n,{di},{Zi},{Mi},V) =p(h|n,{d i }) = 



m 



1 “I 



(N—N fg )\ 

0 



h, n, {di} 
consistent 
otherwise 



( 8 ) 



where N is the total number of features in the scene image and Nf g = N — n 
is the number of foreground features predicted by the hypothesis. 

— p(n\{di},{Zi},{Mi\ ^V) is the probability of obtaining n background fea- 
tures. Background features are spurious responses to the feature detector 
that do not match with any known object. We assume a Poisson distri- 
bution for the number of background features [9]. Since scene images may 
have different sizes, the expected number of background detections is pro- 
portional to the area A examined by the feature detector. If A denotes the 
mean number of background features per unit area, then 

-xa (-^4) 
n! 



p(n\{di},{Zi},{Mi},D) = p P oisson(n\\,A) = e 



(9) 
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— p(di\Zi, Mi,T>) is the probability of detecting the indicated model features 
from model M*. Let p t] denote the probability that feature j of model M t 
is detected in the scene. The probability that it is missing is (1 — Pij ) . We 
break p(di\Zi, Mi,V) into a term for detected features and a term for missing 
features to obtain 



p(d i \z i ,M i ,v)= n Pij - n (i - Pij ) (io) 

j detected j missing 

— p({Zi}, {Mi}\T>) is the prior on detecting objects {Mi} in poses {Zi}. We 
model this prior by a uniform density over frame transformations and com- 
binations of model objects in the scene. Thus, this term is dropped in the 
implementation presented here. 



4 Hypothesis Search 

The recognition process consists of finding the hypothesis H that maximizes 
v'(H). Unfortunately, due to the size of the hypothesis space "H, it is not possible 
to evaluate v'(H) for each H G H. Early work by Grimson (e.g. [14]) showed the 
exponential growth of the search tree and the need for hypotheses pruning. Here, 
we use the A* search technique [10,11] to incrementally construct a reasonable 
hypothesis while only examining a small fraction of the hypothesis space. 

In constructing incrementally a solution, we introduce the notion of a partial 
hypothesis to refer to a partial specification of a hypothesis. In particular, a par- 
tial hypothesis specifies a set of models {Mi} and their corresponding poses {Zi} 
as well as a pairing between scene features and model features. Unpaired scene 
features are either marked as background or unassigned, whereas are either miss- 
ing or unassigned. The partial hypothesis does not dictate how the unassigned 
scene or model features are to be treated. A completion of a partial hypothesis 
is a hypothesis that makes the same assignments as the partial hypothesis, but 
in which there are no unassigned scene features. A completion may introduce 
new models, make pairings between unassigned scene and model features, mark 
unassigned scene features as background, or mark unassigned model features as 
missing. 

4.1 A* 

We can organize the set of all partial hypotheses into a tree. The root of the tree 
is the partial hypothesis containing no models and in which all scene features 
are unassigned. The leaves of the tree are all complete hypotheses, (i.e. H). 
Descending a branch of the tree corresponds to incrementally making decisions 
about feature assignments in order to further specify a partial hypothesis. 

We prioritize the exploration of the tree by computing a valuation for each 
partial hypothesis. Partial hypotheses are entered into a priority queue according 
to this valuation. At each step of the search procedure, the highest valuation 



Recognition by Probabilistic Hypothesis Construction 



61 



partial hypothesis is dequeued and split into two new partial hypotheses. In one 
of these new hypotheses, a certain feature assignment is made. In the other new 
hypothesis, that feature assignment is expressly forbidden from occuring. This 
binary splitting ensures that a search of the hypothesis tree visits each partial 
hypothesis at most once. 

4.2 Partial Hypothesis Valuation 

To produce an effective search strategy, the valuation of a partial hypothesis 
should reflect the valuation of its best possible completion. If these two quan- 
tities were equal, the search would immediately descend the tree to the best 
complete hypothesis. However, it is impossible to compute the valuation of the 
best possible completion before actually finding this completion, which is the 
task of the search in the first place. Therefore, we will define the valuation of a 
partial hypothesis using a heuristic. 

The heuristic we use can be thought of as the “optimistic worst-case sce- 
nario”. It is the valuation of the partial hypothesis’s completion in which all 
unassigned scene features are marked as background and all unassigned model 
features are dropped from the model. Unassigned model features are counted as 
neither detected nor missing. They do not enter into probability computations. 

Note that this choice of heuristic is coherent with the expression for the valu- 
ation of a complete hypothesis. As the algorithm makes assignments in a partial 
hypothesis, its valuation approaches the valuations of its possible completions. 
Furthermore, this valuation is likely to serve as a decent guide for the search pro- 
cedure. It is a measure of the minimum performance offered by a branch under 
the assumption that further assignments along that branch will do no harm. 

4.3 Initialization 

A list of potential database feature matches is created for each scene feature 
based on appearance. The empty partial hypothesis is split into two based on 
the best appearance match. One subbranch accepts this match, the other rejects 
it and forbids it. 

4.4 Search Step 

The partial hypothesis H with the highest valuation is dequeued. If H contains 
a model in which there are unassigned features, the algorithm picks one of these 
unassigned model features. A similar splitting to that in the initialization step 
is performed: one subbranch adds the match to the hypothesis, and the other 
forbids it as far as this hypothesis is concerned. In order to save computation 
time, we greedily follow only the branch that results in a better valuation. This 
is reasonable for rigid models in which the pose constraints should allow very 
few possibilities for a correct match in the scene. 

If there are no unassigned model features in H , we pick the unassigned scene 
feature with the best appearance based match and split the hypothesis on this 
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assignment: one subbranch accepts the match and adds the corresponding model 
to the hypothesis, the other rejects the match, and adds no information regarding 
this model. 

In both of the above cases, the resulting partial hypothesis or hypotheses are 
enqueued and the process is repeated. 

4.5 Termination 

The search process corresponding to one object terminates when no more unas- 
signed features are available in this object. The scene features paired with this 
object are removed, and the search iterates with the remaining scene features. 
If all model objects have been considered without fully explaining the scene, the 
unassigned scene features are considered as background detections. 



a) 

Scene image Database 




Model M, 



Model M 2 



Model M y 



b) 

Scene image Database 




Fig. 1. Sketch of hypothesis build: a) Initialization: The best appearance match in the 
database is identified for each scene feature. Each such match is entered in the queue 
as a partial hypothesis, b) Search for a new match in the partial hypothesis which has 
highest valuation: we look for an unassigned feature in the same model image Mi. This 
feature is mapped to its best appearance match in the scene, if this new pairing is 
coherent with the pose predicted by the hypothesis - otherwise, the match is rejected. 
The pose is then updated based on the new match. 



5 Learning 

Several components of the probabilistic framework given above must be inferred 
from training examples. Since our system requires only a single training image 
per object, we cannot estimate separate appearance and pose densities for each 
feature in an object model. We therefore utilize the entire feature database in 
estimating global probability densities which can be applied to all features. Note 
that only training images are used here, not the test set. 
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5.1 Background Features 



To estimate the background appearance density, we assume that a typical back- 
ground feature looks like a feature found in the database. The background, like 
the database, is composed of objects. It just happens that these objects are 
not in the database. A probability density for the appearance of features in 
the database describes the appearance of background detections in a scene. We 
model this density with a full covariance gaussian density. Letting g and Ab g 
denote the mean and covariance of the database feature appearance vectors, 



Pb g (x|r>) = 



(27r)i|A, 



o 2 ( X app Mbg) ^bg ( X app Mbg) 



( 11 ) 



bgl 



where x app is the appearance vector of feature x and d the dimension of appear- 
ance vectors. 

A typical model generates 500 to 1000 features, resulting for a database with 
100 objects in a total of 50,000 to 100,000 training examples for the background 
appearance density. As our experiments used 128-dimensional appearance vec- 
tors, this was a sufficient number of examples for estimating the gaussian density. 

The mean number of background detections per unit area, A, is program- 
mer specified in our current implementation. When running Lowe’s detection 
method on our training and test sets, 80% of the detections were assigned to 
the background, therefore we chose this same fraction for A. This parameter has 
only weakly effects on the total probability as the terms for pose and appearance 
dominate. 



5.2 Foreground Features 



The foreground appearance density must describe how closely a scene feature 
resembles the model feature to which it is matched. This density is difficult to 
estimate as in principle, it involves establishing hundreds of thousands of ground 
truth matches by hand. A possible shortcut is looking at statistics coming from 
planar scenes seen from different viewpoints [15], or synthetic deformations of 
an image [3]. 

Here we followed a different approach: we approximate a good match for a 
feature by its closest match in appearance in the database. The difference in 
appearance between correctly matched foreground features is modeled with a 
gaussian density with full covariance matrix, and the covariance matrix Af g is 
estimated from the difference in appearance between database features paired 
in such a manner. This yields 



Pfg( x |hi , dj, M i} V) = 



o 2 ( Xa PP Yapp) ( Xa PP 



yapp) 



(2tt)S|X7 1 



(12) 



fgl 



where y = h“ (x) is the model feature paired with scene feature x. 

Unlike background feature pose which are modeled with a uniform distribu- 
tion in equation (5), foreground features are expected to lie in a pose consistent 
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with that of their corresponding model. In particular, model pose Zj predicts a 
scene location, orientation, and scale for each feature of M, . If the hypothesis 
matches scene feature x to model feature y, and Z \ maps y to z, we write 

ppo Se (x|h i ,d J ,Z i ,M l ,r>) = Gioc(x|z) • G e (x|z) • G s (x|z) • (13) 

where Gi oc , Gg, and G s are Gaussian densities for location, orientation, and log 
scale, respectively, with means given by the pose of z. The covariance parameters 
of these densities are currently specified by hand, with values of 20 pixels for 
location, half an octave for log-scale and 60 degrees for orientation (orientation 
was quite unreliable). 

We determine the model pose Zi by solving for the similarity transform 
that minimizes, in the least-squares sense, the distance between observed and 
predicted locations of foreground model features. Zi is updated whenever a pre- 
viously unassigned feature of Mi is matched. 

The probability Pij of detecting individual features is set to the same value 
across features and models. A reasonable choice is the fraction of features that 
are typically needed to produce a reliable pose estimate. This value was obtained 
by running Lowe’s detection method on our training and test sets: in average 
20% of a model features were found in a test image containing this model. This 
value of 20% was used for . 



a) 

matched keypoints 

predicted 

parts positions 
predicted 
frame position 



b) 





set of all 
scene features 




c) 




Fig. 2. Example of result for a textured object included in a complex scene (only 
one detection shown here). According to this hypothesis, the box displayed in the 
model image is transformed into the box shown in the scene image, a) Initial object 
b) Result of Lowe’s algorithm. Since the stuffed bear is a textured object, detection of 
similar features can occur in many locations, leading to incorrect pairings. As a result, 
the frame transformation, estimated only from the features positions, is inaccurate, c) 
Result of the probabilistic search. 
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6 Experimental Results 

In the absence of “standard” training and test sets of images containing both 
objects and clutter, we compare the performance of our probabilistic search 
method to that of Lowe’s algorithm on a training set consisting of 100 images 
of toys and common kitchen items, with a single image per object. The test set 
contained images of single objects, as well as complicated scenes that include 
several objects, ranging from 1 to 9 objects. It included 80 test images, with 
a total of 254 objects to be detected (each object was considered as one de- 
tection). Some test images didn’t contain any learned object. In that case all 
feature detections are expected to be assigned to the background. We used a 
resolution of 480 x 320 for training images and test images of single objects, 
and 800 x 533 for complex scenes. All images were taken in a kitchen, with 
an off-the-shelf digital camera, and no precautions were taken with respect to 
lighting conditions, viewpoint angle and background. In particular, the lighting 
conditions varied significantly between training and test images, and viewing 
angles varied between 0 and 180 degrees (picture of the back of an object taken 
as test while the corresponding model was a picture of the front of the object). 
No image was manually segmented, and the proportion of features generated 
by an object in a model or test image, ranged from 10%to 80% (80% for a sin- 
gle object occupying most of the image). The database is available online from 
http://www.vision.caltech.edu/html-files/archive.html. 

Our algorithm achieved a detection performance similar to Lowe’s system, 
with a detection rate of 85%. Figure 3 shows ROC curves for both methods. 
The threshold used is the accuracy of the best hypothesis at the end of the 
search. Since our method verifies the coherence of each match by scoring partial 
hypothesis, our false alarm rate was lower than that of Lowe’s method. 

In order to measure the accuracy of the pose transformations estimated by 
each method, the training and test images were manually marked with ground 
truth information. An ellipse was fitted, and a canonical orientation was chosen, 
for each object. We measured the accuracy of the transformation with the dis- 
tance in pixels, between the predicted positions of the ellipses in a scene, and 
the ground truth previously recorded. The error was averaged across points reg- 
ularly spaced on the ellipse and across test images. We obtained a mean error of 
45 pixels for our method, and 56 pixels for Lowe’s algorithm. 

Our approach requires to examine and evaluate a number of partial and 
complete hypotheses that is much higher than with Lowe’s method. As a result, 
the probabilistic algorithm is the slower of the two methods. Our unoptimized 
code for Lowe’s method takes in average 2 seconds on a Pentium 4 running at 
2.4GHz to identify objects in a 800 x 533 image, while our probabilistic algorithm 
requires on average 10 seconds for the same image. 

In practice, the A* search achieves only little pruning, typically 10-20% of the 
branches are eliminated. Therefore, the valuation heuristic was coupled with a 
stopping criterion (depth-first completion of the partial hypothesis that performs 
best after 4000 iterations). The main computational benefit of the A* method 
in this paper, is to introduce a framework for evaluating partial hypotheses in a 
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■ Our probabilistic algorithm 
Lowe's method 



Fig. 3. ROC based on the accuracy of the pose estimated by the best hypothesis. It 
measures how much the hypothesis’ prediction of the object position, differs from the 
ground truth. This quantity can be measured for both recognition systems. 
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Fig. 4. Other example of recognition in a complex environment, a) and b) present 
one match obtained by our probabilistic search, c) and d) are the best result from 
Lowe’s voting approach. Since Lowe’s method does not evaluate geometric and ap- 
pearance quality of hypotheses, numerous incorrect correspondences are accepted. As 
a result, the estimated frame position is inaccurate. The probabilistic search accepts 
only matches that are geometrically coherent, and leads to accurate pose parameters. 
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Fig. 5. Samples from our training and test sets. The red boxes show locations where 
models were identified 



way that is coherent with the valuation of complete hypotheses, and a ranking 
of hypotheses that leads to efficient search. 

7 Discussion and Conclusion 

We have presented a new probabilistic model and efficient search strategy for 
recognizing multiple objects in images. Our model provides a unified view of two 
previous lines of research: it may be thought as a probabilistic interpretation of 
David Lowe’s work [7,8] or, conversely, as a special case of the constellation 
model [4] where many of the parameters are pooled amongst models, rather 
than learned individually. 

Our experiments indicate that the system we propose achieves the same 
detection rate as Lowe’s algorithm with significantly lower false alarm rates. 
The localization error of detected objects is also smaller. The price to be paid 
is a slower processing time, although this may not be a significant issue since 
our code is currently not optimized for speed. The front-end of both systems 
was identical (feature detection, feature representation, feature matching) and 
therefore all measurable differences are to be ascribed to the probabilistic model 
and to the matching algorithm. 

It is clear that the heuristic we chose for ranking partial hypotheses is suscep- 
tible of improvement. In choosing it we followed intuition rather than a principled 
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approach. This is obviously an area for further investigation. Developing better 
techniques for estimating the probability density function of appearance and pose 
error of both foreground and background features is another issue deserving of 
further attention. 
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Abstract. We describe a novel method for human detection in single 
images which can detect full bodies as well as close-up views in the pres- 
ence of clutter and occlusion. Humans are modeled as flexible assemblies 
of parts, and robust part detection is the key to the approach. The parts 
are represented by co-occurrences of local features which captures the 
spatial layout of the part’s appearance. Feature selection and the part 
detectors are learnt from training images using AdaBoost. 

The detection algorithm is very efficient as (i) all part detectors use 
the same initial features, (ii) a coarse-to-fine cascade approach is used 
for part detection, (iii) a part assembly strategy reduces the number of 
spurious detections and the search space. The results outperform existing 
human detectors. 



1 Introduction 

Human detection is important for a wide range of applications, such as video 
surveillance and content-based image and video processing. It is a challenging 
task due to the various appearances that a human body can have. In a gen- 
eral context, as for example in feature films, people occur in a great variety of 
activities, scales, viewpoints and illuminations. We cannot rely on simplifying 
assumptions such as non-occlusion or similar pose. Of course, for certain appli- 
cations, such as pedestrian detection, some simplifying assumptions lead to much 
better results, and in this case reliable detection algorithms exist. For example, 
SVM classifiers have been learnt for entire pedestrians [14] and also for rigidly 
connected assemblies of sub-images [13]. Matching shape templates with the 
Chamfer distance has also been successfully used for pedestrian detection [1,5]. 
There is a healthy line of research that has developed human detectors based 
on an assembly of body parts. Forsyth and Fleck [4] introduced body plans for 
finding people in general configurations. Ioffe and Forsyth [6] then assembled 
body parts with projected classifiers or sampling. However, [4,6] rely on simplis- 
tic body part detectors - the parts are modelled as bar-shaped segments and 
pairs of parallel edges are extracted. This body part detector fails in the presence 
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of clutter and loose clothing. Similarly, Felzenszwalb and Huttenlocher [2] show 
that dynamic programing can be used to group body plans efficiently, but sim- 
plistic colour-based part detectors are applied. An improvement on body part 
detection is given in Ronfard et al. [17] where SVMs are trained for each body 
part. An improvement on the modelling of body part relations is given in Sigal 
et al. [21], where these are represented by a conditional probability distribution. 
However, these relations are defined in 3D, and multiple simultaneous images 
are required for detection. 

In this paper we present a robust approach to part detection and combine 
parts with a joint probabilistic body model. The parts include a larger local con- 
text [7] than in previous part-based work [4,17] and they therefore capture more 
characteristic features. They are however sufficiently local (cf. previous work on 
pedestrian detectors [14]) to allow for occlusion as well as for the detection of 
close-up views. We introduce new features which represent the shape better than 
the Haar wavelets [14], yet are simple enough to be efficiently computed. Our 
approach has been inspired by recent progress in feature extraction [10,18,19, 
20], learning classifiers [15,22] and joint probabilistic modelling [3]. 

Our contribution is three- fold. Firstly, we have developed a robust part de- 
tector. The detector is robust to partial occlusion due to the use of local features. 
The features are local orientations of gradient and Laplacian based filters. The 
spatial layout of the features, together with their probabilistic co-occurrence, 
captures the appearance of the part and its distinctiveness. Furthermore, the 
features with the highest occurrence and co-occurrence probabilities are learnt 
using AdaBoost. The resulting part detector gives face detection results compa- 
rable to state of the art detectors [8,22] and is sufficiently general to successfully 
deal with other body parts. Secondly, the human detection results are signifi- 
cantly improved by computing a likelihood score for the assembly of body parts. 
The score takes into account the appearance of the parts and their relative po- 
sition. Thirdly, the approach is very efficient since (i) all part detectors use the 
same initial features, (ii) a coarse-to-fine cascade approach successively reduces 
the search space, (iii) an assembly strategy reduces the number of spurious de- 
tections. 

The paper is structured as follows. We introduce the body model in sec- 
tion 2. We then present the robust part detector in section 3, and the detection 
algorithm in section 4. Experimental results are given in section 5. 



2 Body Model 



In this section we overview the body model which is a probabilistic assembly of a 
set of body parts. The joint likelihood model which assembles these parts is de- 
scribed in section 2.1. The body parts used in the model are given in section 2.2, 
and geometric relations between the parts in section 2.3. 
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2.1 Joint Likelihood Body Model 

Our classification decision is based on two types of observations, which corre- 
spond to the body part appearance and relative positions of the parts. The ap- 
pearance is represented by features F and the body part relations by geometric 
parameters 1Z. The form of a Bayesian decision for body B is: 

P(B\K,F) = p(K\r,B) p(T\B) p(B) 
p[non B\1Z,F) p(lZ\F,non B) p{T\non B) p{non B) 

The first term of this expression is the probability ratio that body parts are 
related by geometric parameters measured from the image. The second term is 
the probability ratio that the observed set of features T belong to a body: 

P(F\ B ) = TT P(f, x f\ B ) 
p(T\non B) p(f, x /| non B) 

This set consists of a number of local features / and their locations xy in a 
local coordinate system attached to the body. The third term of (1) is a prior 
probability of body and non-body occurence in images. It is usually assumed 
constant and used to control the false alarm rate. 

Individual body part detectors are based on appearance (features and their 
locations) and provide a set of candidates for body parts. This is discussed in 
section 3. Given a set of candidate parts the probability of the assembly (or a 
sub-assembly) is computed according to (1). For example, suppose that a head H 
is detected based on the appearance, i.e. p(F\H) /p(F\non H) is above threshold, 
then the probability that an upper body (U) is present can be computed from the 
joint likelihood of the upper-body /head sub-assembly p(U , H). Moreover, a joint 
likelihood can be computed for more than two parts. In this way we can build 
a body structure by starting with one part and adding the confidence provided 
by other body part detectors. Implementation details are given in section 4. 

2.2 Body Parts 

In the current implementation we use 7 different body parts as shown in Figure 1. 
There are separate parts for a frontal head (a bounding rectangle which includes 




(a) (b) (c) (d) (e) 

Fig. 1. Body parts, (a) Frontal head and face (inner frame), (b) Profile head and face 
(inner frame), (c) Frontal upper body, (d) Profile upper body, (e) Legs. 




72 



K. Mikolajczyk, C. Schmid, and A. Zisserman 



the hair), and face alone. Similarly there is a profile head part and a profile face 
part. 

Each body part is detected separately, as described in section 3 based on its 
likelihood ratio. 



2.3 Body Geometric Relations 

The probability of a false positive for an individual detector is higher then 
for several detectors with a constraint on geometric relations between parts. 
The geometric relationship between the parts is here represented by a Gaussian 
G(x i — a? 2 , 2/i — J/ 2 , cn /cr 2 ) depending on their relative position and relative scale, 
cr i and og correspond to the scales (sizes) at which two body parts are detected. 
These parameters are learnt from training data. The size of a human head can 
vary with respect to the eyes/mouth distance. Similarly, the scale and the rela- 
tive location between other body parts can vary for people. Figure 2(b) shows 
the Gaussian estimated for the head location with respect to the face location. 
Figure 2(c-d) shows the geometric relations for other body parts. We need to 
estimate only one Gaussian relation between two body parts, since the Gaussian 
function in the inverse direction can be obtained by appropriately inverting the 
parameters. Note that each of the detectors allows for some variation in pose. 
For example, the legs training data covers different possible appearance of the 
lower body part. 




(a) (b) (c) (d) (e) (f) 



Fig. 2. Gaussian geometric relations between body parts, (a) Frontal face location, (b) 
Frontal head location with respect to the face location, (c) Profile location, (d) Profile 
head location with respect to the profile location, (e) Profile upper body location with 
respect to the head, (f) Frontal upper body location with respect to the head location, 
and legs with respect to the upper body. 



3 Body Part Detector 

In this section we present the detection approach for individual body parts. In 
sections 3.1 and 3.2 we describe the low-level features and the object represen- 
tation. Section 3.3 explains the classifiers obtained from the features and the 
learning algorithm. 
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3.1 Orientation Features 

An object’s appearance is represented by orientation-based features and local 
groupings of these features. This choice is motivated by the excellent performance 
of SIFT descriptors [10,12] which are local histograms of gradient orientations. 
SIFT descriptors are robust to small translation and rotation, and this is built 
into our approach in a similar way. 

Orientation features. Our features are the dominant orientation over a neigh- 
bourhood and are computed at different scales. Here we use 5 scale levels and a 
3-by-3 neighbourhood. Orientation is either based on first or second derivatives. 

In the case of first derivatives, we extract the gradient orientation. This ori- 
entation is quantized into 4 directions, corresponding to horizontal, vertical and 
two diagonal orientations. Note that we do not distinguish between positive and 
negative orientations. We then determine the score for each of the orientations 
using the gradient magnitude. The dominant direction is the one which obtains 
the best score. If the score is below a threshold, it is set to zero. Figure 3(b) 
shows the gradient image and Figure 3(c) displays the dominant gradient orien- 
tations where each of the 5 values is represented by a different gray-level value. 
Note the groups of dominant orientations on different parts of the objects. 

A human face can be represented at a very coarse image resolution as a col- 
lection of dark blobs. An excellent blob detector is the Laplacian operator [9]. 
We use this filter to detect complementary features like blobs and ridges. We 
compute the Laplacian ( d xx + d yy ) and the orientation of the second deriva- 
tives (arctan (d yy /d xx )). We are interested in dark blobs therefore we discard 
the negative Laplacian responses, since they appear on bright blobs. Figure 3(d) 
shows the positive Laplacian responses. Similarly to the gradient features we 
select the dominant orientation. Second derivatives are symmetrical therefore 
their responses on ridges of different diagonal orientations are the same. Conse- 
quently there are 3 possible orientations represented by this feature. Figure 3(e) 
displays the dominant second derivative orientations where each orientation is 
represented by a different gray-level value. 




Fig. 3. Orientation features, (a) Head image, (b) Gradient image, (c) Dominant gra- 
dient orientations, (d) Positive Laplacian responses, (e) Dominant orientations of the 
second derivatives. 
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Feature groups. Since a single orientation has a small discriminatory power, we 
group neighbouring orientations into larger features. The technique described 
below was successfully applied to face detection [11,19]. We use two different 
combinations of local orientations. The first one combines 3 neighbouring ori- 
entations in a horizontal direction and the second one combines 3 orientations 
in a vertical direction. Figure 4(a) shows the triplets of orientations. A single 
integer value is assigned to each possible combination of 3 orientations. The 
number of possible values is therefore v max = 5 3 = 125 for the gradient and 
v max = 4 3 = 64 for the Laplacian. More than 3 orientations in a group signif- 
icantly increase the number of possible combinations and poorly generalize. In 
summary, at a given scale there are four different feature group types vp. hori- 
zontal and vertical groups for gradient orientations and horizontal and vertical 
groups for the Laplacian. 

3.2 Object Representation 

The location of a feature group on the object is very important as we expect 
a given orientation to appear more frequently at a particular location and less 
frequently at the other locations. The location is specified in a local coordinate 
system attached to the object (Figure 4(b)). To make the features robust to 
small shifts in location and to reduce the number of possible feature values we 
quantize the locations into a 5 x 5 grid (Figure 4(c)). 

In the following we will use the notation (x, y, i>t) to refer to a feature group 
of type Vt at the grid location ( x,y ). For simplicity we will refer to this as a 
feature (x,y,v t ). 




(a) (b) (c) 



Fig. 4. Local groups of features, (a) Two groups of local orientations, (b) Location of 
the feature on the object, (c) Grid of quantized locations. 



3.3 Classifiers 

To build a reliable detector we need a powerful classifier. Such classifiers can be 
formed by a linear combination of weak classifiers, and trained with a learning 
algorithm to excellent classification results at a small computational cost [8,22]. 
In the following we explain the form of our weak classifiers. 
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Weak classifiers. The above described features are used to build a set of clas- 
sifiers. A weak classifier is the log likelihood ratio of the probability of feature 
occurrence on the object with respect to the probability of feature occurrence 
on the non-object: 

p(f a \non object) 

where f a is a single feature (x,y,Vt). Intuitively, some features occur frequently 
together on object but randomly together on non-object. Therefore, a better 
weak classifier using joint probability between two features is 



hfab = ln ( 



p(fa, fb\object) 
p(fa, fb\non object) 



) 



( 2 ) 



where f a ,fb is a pair of features, which simultaneously occur on the object. 
The probabilities p(f a \object) and p(f a , fb\object) and the corresponding prob- 
abilities for non-object can be estimated using multidimensional histograms of 
feature occurrences. Each bin in the histogram corresponds to one feature value. 
The probabilities are estimated by counting the feature occurrence on positive 
and negative examples. Some features do not appear at a particular object lo- 
cation which indicates a zero probability. To avoid a very large or infinite value 
of a weak classifier we smooth the predictions as suggested in [15]. 



Strong classifiers. A strong classifier is a linear combination of M weak classifiers 

M 

H M {xi) = Y h f m (xi) 

m = 0 

where Xi is an example and the class label is sign[H(xi)\. The weak classifiers 
h f a and h f ab are combined using the real version of AdaBoost as proposed in [8, 
15]. The error function used to evaluate the classifiers is 

E(H m ) = ^exp[-j/ i i? M (a; i )] (3) 

i 

where yi is a class label [—1,1] for a given training example x^. 

A strong classifier is trained separately for each of the four feature types 
v t . This is motivated by the efficiency of the cascade approach. One feature 
type at one scale only has to be computed at a time for each cascade level. We 
compute features at different scales, therefore the number of strong classifiers 
is the number of feature types times the number of scales. The initial number 
of strong classifiers is therefore 20 (4 feature types at 5 scales). The number of 
weak classifiers used by AdaBoost depends on the scale of features and can vary 
from 16 to 5000. 



Cascade of classifiers. The strong classifiers are used to build a cascade of clas- 
sifiers for detection. The cascade starts with the best of the fastest strong classi- 
fiers. In this case the fastest classifiers are computed on the lowest scale level and 
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the best one corresponds to that with the lowest classification error (equation 3). 
Next, we evaluate all the pairs and the following classifier in the cascade is the 
one which leads to the best classification results. If the improvement is insignifi- 
cant we discard the classifier. The number of classifiers in a cascade is therefore 
different for each body part. The coarse-to-fine cascade strategy leads to a fast 
detection. The features are computed and evaluated for an input window only 
if the output of the previous classifier in the cascade is larger than a threshold. 
The thresholds are automatically chosen during training as the minimum classi- 
fier responses on the positive training data. The output of each detector is a log 
likelihood map given by the sum of all strong classifiers 

C 

D(xi) = H c (xj ) 

C— 1 

where C is the number of strong classifiers selected for the cascade. The location 
of the detected object is given by a local maximum in the log likelihood map. 
The actual value of the local maximum is used as a confidence measure for the 
detection. Note that the windows classified as an object have to be evaluated by 
all the classifiers in the cascade. The algorithm selected 8 strong classifiers out 
of 20 initial for each of the face detectors and 8 classifiers for each of the head 
detectors (4 feature types at 2 scales). The upper body and legs detectors use 
4 classifiers selected out of 20 (two feature types of gradient orientations at 2 
scales) . 

4 Detection System 

In this section we describe the detection system, that is how we find the in- 
dividual parts and how we assemble them. Detection proceeds in three stages: 
first, individual features are detected across the image at multiple scales; second, 
individual parts are detected based on these features; third, bodies are detected 
based on assemblies of these parts. 

Individual part detector. To deal with humans at different scales, the detection 
starts by building a scale-space pyramid by sampling the input image with the 
scale factor of 1.2. We then estimate the dominant orientations and compute the 
groups of orientations as described in section 3.1. For the profile detection we 
compute a mirror feature representation. The estimated horizontal and vertical 
orientations remain the same, only the diagonal orientations for gradient features 
have to be inverted. Thus, for a relatively low computational cost we are able 
to use the same classifiers for left and right profile views. A window of a fixed 
size (20 x 20) is evaluated at each location and each scale level of the feature 
image. We incorporate the feature location within the window into the feature 
value. This is computed only once for all the part detectors, since we use the 
same grid of locations for all body parts. The feature value is used as an index 
in a look-up table of weights estimated by AdaBoost. Each look-up table of a 
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body part corresponds to one strong classifier. The number of look-up tables 
is therefore different for each body part detector. The output of the detector 
is a number of log likelihood maps corresponding to the number of body parts 
and scales. The local maxima of the log likelihoods indicate the candidates for 
a body part. To detect different parts individually we threshold the confidence 
measure. A threshold is associated with each part and is used to make the final 
classification decision. However, better results are obtained by combining the 
responses of different part detectors and then thresholding the joint likelihood. 



Joint body part detector. Given the locations and magnitudes of local maxima 
provided by individual detectors we use the likelihood model described in sec- 
tion 2.1 to combine the detection results. We start with a candidate detected 
with the highest confidence and larger than a threshold. This candidate is classi- 
fied as a body part. We search and evaluate the candidates in the neighbourhood 
given by the Gaussian model of geometric relations between two parts. 

For example, suppose that a head (H) is detected. This means that the log 
likelihood ratio 



D h 



p[T\non H ) 



is above threshold. We can then use the position (xh,Vh) and scale an of the 
detected head to determine a confidence measure that there is an upper body 
(U) at (x,y) with scale a. In detail G(xh — x,yn — y,^n/cr) is used to weight 
the computed Du (where Du is defined in a similar manner to Dh above). The 
final score is 



Du\h{x, y, <j) = Du(x, y, a) + G(x H - x,y H - y, <7 H /a)D H (x Hl vh, <?h) (4) 

and the upper body is detected if this score is above threshold. 

A confidence measure can also be computed for more than two parts; e.g. for 
an upper body, head and legs (L) sub-assembly D l \u,h = Dl + G(Rl\u)Du\h ■ 
If this score is higher than a threshold we accept this candidate as the body part 
and remove the closely overlapping neighbours. We can set the decision thresh- 
old higher than for the individual detectors since the confidence for body part 
candidates is increased with the high confidence of the other body parts. Given 
the new body part location we continue searching for the next one. There are 
usually few candidates to evaluate in the neighbourhood given by the Gaussian 
model. 

The current implementation does not start to build the model from legs since 
this detector has obtained the largest classification error (cf. equation 3) and the 
legs are not allowed to be present alone for a body. In most of the body examples 
the highest log likelihood is obtained either by a face or by a head. 
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5 Experiments 

5.1 Training Data 

Each body part detector was trained separately on a different training set. Ap- 
proximately 800 faces were used to train the frontal face detector and 500 faces 
for the profile detector. The frontal views were aligned by eyes and mouth and the 
profiles by eyebrow and chin. For each face example we add 2 in-plane-rotation 
faces at -10 and 10 degrees. To train the frontal upper body/leg model we used 
250/300 images of the MIT pedestrian data base [14]. 200 images for training 
the profile upper body model were collected from the Internet. The initial clas- 
sifiers were trained on 100K negatives example obtained from 500 images. We 
then selected for each body part 4000 non-object examples detected with initial 
classifiers. The selected examples were then used to retrain the classifiers with 
AdaBoost. 

5.2 Detection Results 

Face. The MIT-CMU test set is used to test the performance of our face de- 
tectors. There are 125 images with 481 frontal views and 208 images with 347 
profiles. The combined head-face models for frontal and profile faces were used in 
this test. Figure 5(a) shows the face detection results. The best results were ob- 
tained with the frontal face detector using a combination of simple features and 
feature pairs. We obtain a detection rate of 89% for only 65 false positives. These 
results are comparable with state of the art detectors (see figure 5(c)). They can 
be considered excellent given that the same approach/features are used for all 
human parts. Compared to the classifiers using only single features the gain is 
approximately 10%. A similar difference can be observed for the profile detectors. 
The performance of the profile detector is not as good as the frontal one. The 
distinctive features for profiles are located on the object boundaries, therefore 
the background has a large influence on the profile appearance. Moreover the 
test data contains many faces with half profile views and with in-plane-rotation 
of more then 30 degrees. Our detector uses a single model for profiles and cur- 
rently we do not explicitly deal with in plane rotations. The detection rate of 
75% with only 65 false positives is still good and is the only quantitative result 
reported on profile detection, apart from [19]. 

Human. To test the upper body and legs detector we use 400 images of the MIT 
pedestrian database which were not used in training. 200 images containing no 
pedestrians were used to estimate the false positive rate. There are 10800K win- 
dows evaluated for the negative examples. The false positive rate is defined as the 
number of false detections per inspected window. There are 10800A//200 = 54000 
inspected windows per image. Figure 5 (b) shows the detection results for the 
head/face, the frontal view of the upper body part and legs as well as the joint 
upper body/legs model. The results for head/face are converted from figure 5(a) 
and displayed on 5(b) for comparison. The best results are obtained for frontal 
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Face detection results 




Human detection results 
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- 


- 


94.4% 


- 


our approach 


75% 


85% 


89% 


90% 



(c) 



Fig. 5. (a) ROC curves for face detectors. HFp(f a ) are the results for combined frontal 
head H and face F detector using single features f a ■ HFF^fab, fa) are the results for the 
detector using both single features and feature pairs. Similarly for the profile detector 
HFp. (b) ROC curves for head/face, upper body and legs detectors. The results for 
head/face are converted from HF(f a b,f a ) displayed in figure (a). U are the results 
for individual upper body detector and L for the individual legs detector. U\L are the 
results for upper body detector combined with legs detector, (c) Face detection results 
compared to state of the art approaches. 



head/face with the joint model. The result for the upper body and the legs are 
similar. For a low false positive rate the joint upper-body/legs detector is about 
15% better than the individual upper-body and legs detectors. We obtain a de- 
tection rate of 87% with the false positive rate of 1:100000, which corresponds 
to one false positive per 1.8 images. This performance is better than the ones 
reported for pedestrian detection in [13,14]. Note that an exact comparison is 
not possible, since only the number of images selected for the training/test is 
given. In addition, our approach performs well for general configurations in the 
presence of occlusion and partial visibility, see figures 6 and 7. 

Figure 6 illustrates the gain obtained by the joint likelihood model. The 
top row shows the results of the individual detectors and the bottom row the 
combined results. The improvement can be observed clearly. The false positives 
disappear and the uncertain detections are correctly classified. Some other ex- 
amples are shown in Figure 7. 
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Fig. 6. Results for human detection. Top row: individual body part detection. Bottom 
row: detection with the joint likelihood model. The joint likelihood model significantly 
improves the detection results. 




Fig. 7. Human detection with the joint model. Top row: images from movies “Run 
Lola Run” and “Groundhog Day” . Bottom row: images from MIT-CMU database. 
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6 Conclusions 

In this paper we have presented a human detector based on a probabilistic as- 
sembly of robust part detectors. The key point of our approach is the robust 
part detector which takes into account recent advances in feature extraction and 
classification, and uses local context. Our features are distinctive due to encoded 
orientations of first and second derivatives and are robust to small translations 
in location and scale. They efficiently capture the shape and can therefore be 
used to represent any object. The joint probabilities of feature co-occurrence are 
used to improve the feature representation. AdaBoost learning automatically 
selects the best single and pairs of features. The joint likelihood of body parts 
further improves the results. Furthermore, our approach is efficient, as we use 
the same same features for all parts and a coarse-to-fine cascade of classifiers. 
The multi-scale evaluation of a 640 x 480 image takes less than 10 seconds on a 
2GHz P4 machine. 

A possible extension is to include more part detectors, as for example an arm 
model. We also plan to learn more than one lower body detector. If the training 
examples are too different, the appearance cannot be captured by the same 
model. We should then automatically divide the training images in sub-sets and 
learn a detector for each sub-set. Furthermore, we can use motion consistency 
in a video to improve the detection performance in the manner of [11]. 



Acknowledgements. Funding for this work was provided by an INRIA post- 
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Abstract. In the present paper, we address the problem of recovering the true 
underlying model of a surface while performing the segmentation. A novel 
criterion for surface (model) selection is introduced and its performance for 
selecting the underlying model of various surfaces has been tested and 
compared with many other existing techniques. Using this criterion, we then 
present a range data segmentation algorithm capable of segmenting complex 
objects with planar and curved surfaces. The algorithm simultaneously 
identifies the type (order and geometric shape) of surface and separates all the 
points that are part of that surface from the rest in a range image. The paper 
includes the segmentation results of a large collection of range images obtained 
from objects with planar and curved surfaces. 



1 Introduction 

Model selection has received substantial attention in the last three decades due to its 
various applications in statistics, engineering and science. During this time, many 
model selection criteria (Table 1) have been proposed, almost all of which have their 
roots in statistical analysis of the measured data. In this paper we propose a new 
approach to the model selection problem based on physical constraints rather than 
statistical characteristics. Our approach is motivated by our observations that none of 
the existing model selection criteria is capable of reliably recovering the underlying 
model of range data of curved objects (see Figure 1). 

Before we explain our model selection criterion, the problem of range 
segmentation for none-planar objects is briefly reviewed in the next section. Later, we 
propose a novel model selection tool called Surface Selection Criterion (SSC). This 
proposed criterion is based on the minimisation of the bending and twisting energy of 
a thin surface. To demonstrate the effectiveness of our proposed SSC, we devised a 
robust model based range segmentation algorithm for curved objects (not limited to 
planar surfaces). The proposed Surface Selection Criterion allows us the to choose the 
appropriate surface model from a library of models. An important aspect of having a 
correct model is obtaining the surface parameters while segmenting the surface. 
Recovering the underlying model is a crucial aspect of segmentation when the objects 
are not limited to having planar surfaces only (so more than one possible candidate 
exists). 
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Table 1 . Different model selection criteria studied in this paper, N is the number of points and 
P is the number of parameters. J is the fisher matrix of the estimated parameters. L is equal to 
N. d is the dimension of manifold (here 2) and m is the dimension of the data, (here 3). f is the 
degrees of freedom of the assumed t distribution for MCAIC (here 1.5) 



Name 


Criterion 


MDL[24] 


n 

Y r ? + (P/2)log(A0<? 2 
1=1 


GBIC[6] 


y r 2 + (Nd log( 4) + P log( AN))S 2 

i=l 


CP Kanatani [15] 


n 

y r 2 + ( 2(dN + P) - mN )S 2 

i=l 


CP Mallow [19] 


n 

y r 2 + ( - AT + 2 P ) 8 2 

i = 1 


GAIC[16] 


n 

^r 2 + 2(dN + P)S 2 

i=l 


SSD[25] 


y r 2 + ( P log( N + 2) HA + 2 log( p + \))5 2 

i=l 


CAIC[5] 


^ r 2 + P(log ./V + 1)£ 2 

i=l 


CAICF[5] 


n 

n + P(log N + 2 )S 2 + log 1 J 1 

1=1 


GMDL[16] 


n 

y r 2 - (Nd + P)£ 2 \og(s/ L) 2 

i'=l 


MCAIC [4] 


« r 2 1 

(l+ZlYV-iog i+-^y p(io g 7v + i)<y 2 
i=i L f s \ 




Fig. 1 . Comparison of various model selection criteria for synthetic and real range data 
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2 Range Segmentation 

A range image contains 3D information about a scene including the depth of each 
pixel. Segmenting a range image is the task of dividing the image into regions so that 
all the points of the same surface belong to the same region, there is no overlap 
between different regions, and the union of these regions generates the entire image. 

There have been two main approaches to address the range segmentation problem. 
The first one is the region-based approach, in which, one groups data points so that all 
of them can be regarded as members of a parametric plane ([18] or [3]). The other 
approach is based on the edge detection and labelling edges using the jump edges 
(discontinuities). For example, [8] or [14]. 

Although the range image segmentation problem has been studied for number of 
years, the task of segmenting range images of curved surfaces is yet to be 
satisfactorily resolved. The comparative survey of Powell et al. [23] indeed reveals 
the challenges that need to be addressed. An early work in the area of segmentation of 
curved surfaces was published in 1994 by Boyer et al. [4], They used a modified 
version of Bozdogan’s CAIC [5] as a model selection criterion. Later, Besel and Jain 
[3] proposed a range image segmentation algorithm to segment range images of 
curved objects. The performance of their proposed algorithm for segmenting curved 
surfaces has been reported (by Powel et al. [23]) as unsatisfactory in many cases. 
Bab-Hadiashar and Suter [2] have also proposed a segmentation algorithm, which was 
capable of segmenting a range image of curved objects. Although they managed to 
segment range images into regions expressed by a specific quadratic surface, their 
method was limited, as it could not distinguish between different types of quadratic 
surfaces. Although there have been other attempts to segment range data using higher 
order surfaces [11,20,22,28], there have been few range segmentation algorithms 
capable of successfully segmenting curved objects and recovering the order of those 
segments. A complete literature review on range segmentation is beyond the scope of 
this paper. A comprehensive survey and comparison of different range segmentation 
algorithms was reported by Hoover et al. [13]; While a survey on model-based object 
recognition for range images has been reported by Arman and Aggarwal [1 ]. 

In this paper, we propose a range segmentation algorithm that is capable of 
identifying the order of the underlying surface while calculating the parameters of the 
surface. To accomplish this, we propose a new model selection criterion called 
Surface Selection Criterion to recover the correct surface model while segmenting the 
range data. An important aspect of our segmentation algorithm is that it can solve 
occlusion properly (it will be described in the step 6 of the segmentation algorithm). 
To evaluate and compare the performance of our algorithm with other existing range 
segmentation algorithms, we have first tested the proposed algorithm on the ABW 
image database [12]. The results of these experiments are shown in experimental 
results. Since the ABW database does not contain images of curved objects, we then 
created a range image database of a number of objects possessing both planar and 
curved surfaces. The results of those experiments (shown in Figure 1) confirm that 
our algorithm is not only capable of segmenting planar surfaces but can also segment 
higher order surfaces correctly. 
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3 Model Selection 

In this section we propose a Surface Selection Criterion to identify the appropriate 
model from a family of models representing possible surfaces of a curved object. Our 
proposed criterion is based on minimising the sum of bending and twisting energies of 
all possible surfaces. Although, the bending energy of a surface has been used in the 
literature for motion tracking and finding parameters of deformable objects ([29] [30] 
[7]) and also in shape context matching and active contours([17]), it hasn’t been used 
for model selection purposes. 



3.1 Our Proposed Surface Selection Criterion 

Our proposed criterion is based on minimising the sum of bending and twisting 
energy of all possible surfaces in a model library. To formulate our model selection 
criterion, we view the range data of different points of an object as hypothetical 
springs constraining the surface. If the surface has little stiffness, then the surface 
passes close to measurements (fits itself to the noise) and the sum of squared residuals 
between the range measurements and their associated points on the surface will be 
small (the sum of squared residuals in this analogy, relates to the energy of the 
deformed springs). However, to attain such proximity, the surface has to bend and 
twist in order to be close to the measured data. This in turn increases the amount of 
strain energy accumulated by the surface. For model selection, we propose to view the 
sum of bending and twisting energies of the surface as a measure of surface roughness 
and the sum of squared residuals as a measure of fidelity to the true data. A good 
model selection criterion should therefore represent an acceptable compromise 
between these two factors. As one may expect, increasing the number of parameters 
of a surface leads to a larger bending and twisting energies as the surface has more 
degrees of freedom and consequently the surface can be fitted to the data by bending 
and twisting itself so that a closer fit to measured data results (this can be inferred 
from the bending energy formula (Equation 1). However, the higher the number of 
parameters for a surface model assumed, the less the sum of squared residuals is 
going to be. For instance, in the extreme case, if the number of parameters is equal to 
the number of data points (which are used in the fitting process), then the sum of 
squared residuals will be zero whereas its sum of energies will be maximised. 

We have a conjecture as to why this approach should be advantageous. Common 
statistical methods that rely essentially on probability distribution of residuals ignore 
spatial distribution of the deviations of data points from the surface. Whereas the 
above method intrinsically (through a physical model) couples the local spatial 
distribution of residuals to the strain energy in that locality. We argue that this is an 
important point as the range measurements are affected by localised factors (such as 
surrounding texture, surface specularities, etc) as well as by the overall accuracy and 
repeatability of the rangefinders. 

As shown in [27], if a plate is bent by a uniformly distributed bending moment so 
that the xy and yz planes are the principal planes of the deflected surface, then the 
strain energy (for bending and twisting) of the plate can be expressed as: 
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E 



Bending+Twist 




Equation 1 



- 2(1 - V) 



d 2 w d 2 w 
dx 2 dy 2 



dxdy 




where D is the flexural rigidity of the surface and v is Poisson’s ratio (v should be 
very small because in real world-objects the twisting energy in comparison with the 
bending energy is small). In our experiments we assume v = 0.01. We found in our 
experiments that the performance of SSC is not overly sensitive to the small variation 
of this value. In order to scale the strain energy, we divide its value by the strain 
energy of the model with the highest number of parameters (E max ). Therefore, D will 
be eliminated from our computation. 

N 

To capture the trade-off between the sum of squared residuals ~ and the strain 

;=i 

energy £ B6nding+7 - m .„, we define a function SSC such that: 

SSC = f r 2 IN 8 2 + 

i=l max 

where 5 is the scale of noise for the highest surface (the surface with the highest 
number of parameters). The reason that we use the scale of noise for the highest 
surface (as explained by Kanatani [15]) is that the scale of noise for the correct model 
and the scale of noise of the higher order models (higher than the correct model) must 
be close for the fitting to be meaningful. Therefore, it is the best estimation of the true 
scale of noise which is available at this stage. The energy term has been multiplied by 
the number of parameters P in order to discourage choosing a higher order (than 
necessary) model. Such a simple measure produces good discrimination and improves 
the accuracy of the model selection criterion. Having devised a reasonable 
compromise between fidelity to data and the complexity of the model, our model 
selection task is then reduced to choosing the surface that has the minimum value of 
SSC. To evaluate our proposed Surface Selection Criterion and compare it with other 
well known model selection criteria (Table 1), we first created five synthetic data sets 
according to the surface models in Surface Library 1 (one for each model) and 
randomly changed the parameters of each data set 1000 times. We also added 10% 
normally distributed noise. The success rates of all methods in correctly recovering 
the underlying model are shown in Figure 1. To consider more realistic surfaces, we 
then considered a more comprehensive set of surface models (shown in Surface 
Library 2) and repeated the above experiments. The percentages of successes in this 
case are also shown in Figure 1 . 

Finally, to examine the success rate of our Surface Selection Criterion and compare 
it with other selection techniques on real range images, we randomly hand picked 
points of 100 planar surfaces of the objects in ABW range image database [12] and 
also 48 curved (quadratic) surfaces of our range image database. In this case, the 
model library used is the Surface Library 3. The results are also shown in Figure 1. 

As can be seen from Figure 1, the proposed criterion (SSC) is considerably better 
in choosing the right model when it is applied to a variety of real range data. We 
should note here that the performance of MCAIC [4] is expected to be slightly better 
than what we have reported here if the segmentation frame work reported in [4] is 
used (here, we only examined the selection capability of the criteria). 
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Surface Library 1 



Model 1 


z=ax 2 +by 2 +cxy +dx +ey +f 


Model 2 


z=ax 2 +bxy +cx +dy +e 


Model 3 


z=ax 2 +by 2 +cx +dy +e 


Model 4 


z= axy +bx +cy +d 


Model 5 


z=ax +by +C 



Surface Library 2. (It should be noted that bending energy is shift and rotation invariant. 
Therefore there is no need to add more models to this library that have other possible 
combinations of x, y and z) 



Model 1 


ax 2 +by 2 +cz 2 +dxy +eyz +fxy +gx +hy +iz =1 


Model 2 


ax 2 +by 2 +cxy +dyz +exy +fx +gy +hz =1 


Model 3 


ax 2 +by 2 +cz 2 +dx +ey +fz =1 


Model 4 


axy +byz +cxy +dx +ey +fz =1 


Model 5 


ax 2 +by 2 +cz 2 +dxy+ey z+fxy= 1 


Model 6 


ax 2 +by 2 +cx +dy +ez =1 


Model 7 


ax +by +cz =1 



Surface Library 3 



Model 1 


ax 2 +by 2 +cz 2 +dxy +eyz +fxy +gx +hy +iz =1 


Model 2 


ax 2 +by 2 +cz 2 +dx +ey +fz =1 


Model 3 


ax +by +cz =1 



However, to improve the efficiency of our proposed Surface Selection Criterion, 
we can carry out some post processing, provided that we have a set of nested models 
in the model library (like Surface Library 2). 

That is if the sum of squares of non-common terms between the higher surface and 
the next lower surface is less than a threshold, we select the lower surface. This 
simple step also improves the already high success rate of our proposed SSC. 



4 Segmentation Algorithm 

Having found a reliable method for recovering the underlying model of a higher order 
surface, we then proceed to use this method to perform the range segmentation of 
curved objects. Since our segmentation algorithm requires an estimate of the scale of 
noise, we have implemented the method presented in [2]. 



4.1 Model Based Range Segmentation Algorithm 

In this section, we briefly but precisely, explain the steps of the proposed range 
segmentation algorithm. The statistical justification of each step is beyond the scope 
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of this paper but it is suffice to mention that the algorithm has been closely modelled 
on the ones presented in [21,26] for calculating the least median of squares. The 
proposed algorithm combines the above noise estimation technique and our model 
selection criterion SSC and delivers an effective mean for segmenting not only planar 
objects (as can be seen from our experiments with ABW range data) but also curved 
objects containing higher order models. This algorithm has been extensively tested on 
several range data images with considerable success (presented in experimental 
results). The required steps are as follows: 

1 . Eliminate pixels whose associated depths are not valid due to the limitation of the 
range finder used for measuring the depth (mainly due specularities, poor texture, 
etc). These points are usually marked by the range scanner with an out-of-range 
number. If there are no such points we can skip this stage. 

2. Find a localised data group inside the data space in which all the pixels appear on a 
flat plane. Even if there is no planar surface in the image, we can always approximate 
a very small local area (here 15x15) as a planar surface. To implement this stage and 
find such a data group, we choose a number of random points, which all belong to the 
same square of size R (this square is only for the sake of local sampling). Using these 
points, create an over-determined linear equation system. If the number of inliers is 
more than half of the size of the square, then, mark this square as an acceptable data 
group. The size of the square (R) is not important, however it needs to be large 
enough to contain adequate sample points. We set the square size as 15x15 in our 
experiments. We have chosen this size because a square of size 15x15 can contain 
enough samples. In our experiments, 30 samples were used to perform the above step. 

3. Fit the highest model in the library to all the accepted data groups and find the 
residual for each point. Then, repeat the above two steps and accept the data group 
that has the least KT order residual (the choice of K depends on the application [2] and 
is set to 10% for our experiments). This algorithm is not sensitive to the value of K. 
However, if we assume K to be very large, small structures will be ignored. 

4. Apply a model selection method (here SSC) to the extended region (by fitting and 
comparing all models in the model library to the extended region) and find the 
appropriate model. 

5. Fit the chosen model to the whole data (not segmented parts); compute the 
residuals and estimate the scale of noise using the technique explained in the previous 
section. In the next step this scale will help us to reject the outliers. It is important to 
note that performing this step has the advantage that it can also remedy the occlusion 
problem if there is any. This means that if a surface of an object is divided - occluded 
- by another object, we can then rightly join the separated parts as one segment. 

6. Establish a group of inliers based on the obtained scale and reject the outliers. We 
reject those points whose squared residual is greater than the threshold T 2 multiple of 
the scale of noise (see the inequality r 2 n+1 >T 2 5" in the previous section). Then, 

n 

2 v i 9 

recalculate the residuals and compute the final scale using: S = / q /(N — p) . 

1=1 

7. Apply a hole-filling (here, we use a median filter of 10 by 10 pixels) algorithm to 
all inliers and remove holes resulting from invalid and noisy points (points where the 
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range finder has not been able to correctly measure the depth mainly due to their 
surface texture). This step is only for the sake of the appearance of the results and has 
no effect on the segmented surface’s parameters because the fitting has already been 
performed. However, some of the missed invalid, and noisy points can be grouped in 
this step. This step is not essential and can be skipped if desired. 

8. Eliminate the segmented part from the data. 

9. Repeat the steps 1 to 8 until the number of remaining data becomes less than the 
size of the smallest possible region in the considered application. 



5 Experimental Results on Segmentation 

To evaluate the performance of the proposed algorithm, we have conducted a 
comprehensive set of experiments using real range images of various objects. The 
first set of experiments is solely for comparison purposes and is performed on the 
existing ABW database that only includes objects with planar surfaces. It is shown 
that the proposed technique can accurately segment the above database and its 
performance is similar to the best techniques presented in the literature [13]. We have 
then applied our technique to a set of real range images with objects having a 
combination of planar and curved surfaces. By these experiments we have shown that 
the present technique is not only capable of segmenting these objects correctly, but 
also truly identifies the underlying model of each surface. 



5.1 ABW Image Database 

In the first set of our experiments, we applied our algorithm on the ABW[12] range 
image database and compared our results with the ones reported by Hoover et al. [13]. 
As is shown here, the proposed technique is able to segment all of the images, 
correctly. Less than 1% of over-segmentation has occurred which is in turn resolved 
by using a simple merging (post-processing) step. To show the performance of our 
algorithm in estimating angles and comparing it with the results obtained by Hoover 
et al. [13], we randomly chose 100 surfaces and calculated the absolute difference 
between the real angle (calculated using the IDEAS CAD package) and the computed 
angle using the parameters of the segmented surface. The average and the standard 
deviation of the error for our technique and others reported in the literature are shown 
in Table 2. A few of the results of segmenting the ABW range image database are 
shown in Fig. 2. 



Table 2. Comparison of accuracy of estimated angles 



Technique 


Angle diff. (std dev.) 


USF[10] 


1. 6(0.8) 


WSU[11] 


1. 6(0.7) 


UB[14] 


1.3(0. 8) 


UE[9,28] 


1. 6(0.9) 


Proposed algorithm 


1. 4(0.9) 
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Fig. 2. Left: intensity image. Right: segmentation result 



5.2 Curved Objects Database 

To evaluate the performance of our algorithm in segmenting range images of curved 
objects, we created a range image database of a number of objects possessing both 
planar and curved surfaces. The actual data and their segmented results are shown in 
the following figures (Figure 3 to 9). We use a comprehensive model library (Surface 
Library 4), which consists of the most concise possible model for each object in the 
scene. For example, because we have cylinders (or part of) perpendicular to the xy 
plane in our objects, then the model ax 2 + by 2 + cx + dy = 1 is included in the model 
library (Surface Library 4). 

Our segmentation algorithm was able to correctly identify the underlying model for 
each surface using Surface Library 4 as our model library, and SSC as a model 
selection criterion. The following figures shows the results of our experiments. In all 
of these figures, the labels show the underlying detected model. The algorithm has 
been successfully labeling all surfaces. For example surface 1 1,14,15,16, and 4 in Fig. 
3 and Fig. 4, which are cylinders perpendicular to the xy plane, are identified to have 
the underlying Model 5. The underlying model for surface 25 in Fig. 6 was chosen to 
be Model 3, which is a cylinder parallel to the xy plane. Therefore our method not 
only can detect the cylindrical shape of the surface but it is also able to distinguish the 
direction of cylinders (detecting the degeneracy). For all flat surfaces SSC truly 
selects model 8, which represents a flat plane. An advantage of our range 
segmentation algorithm over region growing range segmentation algorithms is the 
way in which it deals with occlusion or separation of parts. Our method can detect 
and solve such problems correctly as can be seen from Fig. 6. In this example the 
planar object is located between two cylinders of the same size whose axis are co- 
linear. The proposed algorithm correctly detects the existence of such issues. 

Surface Library 4 



Model 1 


ax 2 +by 2 +cz 2 +dx +ey +fz =1 


Model 2 


ax 2 +by 2 +cx +dy +eyx =1 


Model 3 


ax 2 +bz 2 +cx +dz +exz =1 


Model 4 


az 2 +by 2 +cz +dx +fxz =1 


Model 5 


ax 2 +by 2 +cx +dy =1 


Model 6 


ax 2 +bz 2 +cx +dz =1 


Model 7 


ay 2 +bz 2 +cy +dz =1 


Model 8 


ax +by +cz =1 
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Surface 17 
Model 8 



Surface 16 
Mode 15 




Surface 15 e. 
Model 5 ' 



Fig. 3. Intensity image (left), plotted range data (middel) and segmentation result (right). SSC 
selects model 5 for the perpendicular cylinders to the xy plane (surface 16 and surface 15). The 
chosen model for the flat surface 17 is model 8. Surface 17, which has two separated planar 
parts, is an example for occlusion 




Fig. 4. Intensity image (left), plotted range data (middle) and segmentation result (right). The 
perpendicular cylinders to the xy plane are detected correctly (model 5) using SSC. The black 
region illustrates the missed data. The roofs of three cylinders are in the same hight, and has 
been correctly segmented by the proposed algorithm 




Fig. 5. Intensity image (left), plotted range data (middle) and segmentation result (right). The 
underlying model for surface 13, which is a cylinder perpendicular to the xy plane is selected to 
be model 5. For planar surfaces SSC selects model 8 as the underlying model As can be seen 
from the plotted rang image, despite of having noisy and invalid data, the algorithm is 
performed 




Fig. 6. Intensity image (left), plotted range data (middle) and segmentation result (right). SSC 
selected model 3 for the cylinders parallel to the xy plane. The underlying surface model for 
planar surface 24 and 23 and also other planar surfaces in the scene are chosen to be model 8 
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6 Conclusion 

In this paper, we have proposed and evaluated a new surface model selection 
criterion. Using this criterion, we have also developed a robust model based range 
segmentation algorithm, which is capable of distinguishing between different types of 
surfaces while segmenting the objects. The proposed techniques both for model 
selection and for range segmentation has been extensively tested and have been 
compared with a wide range of existing techniques. The proposed criterion for model 
selection and the resulting segmentation algorithm clearly outperforms previously 
reported techniques. 
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Abstract. For structured-light range imaging, color stripes can be used for 
increasing the number of distinguishable light patterns compared to binary BW 
stripes. Therefore, an appropriate use of color patterns can reduce the number of 
light projections and range imaging is achievable in single video frame or in 
“one shot”. On the other hand, the reliability and range resolution attainable 
from color stripes is generally lower than those from multiply projected binary 
BW patterns since color contrast is affected by object color reflectance and 
ambient light. This paper presents new methods for selecting stripe colors and 
designing multiple-stripe patterns for “one-shot” and “two-shot” imaging. We 
show that maximizing color contrast between the stripes in one-shot imaging 
reduces the ambiguities resulting from colored object surfaces and limitations in 
sensor/projector resolution. Two-shot imaging adds an extra video frame and 
maximizes the color contrast between the first and second video frames to 
diminish the ambiguities even further. Experimental results demonstrate the 
effectiveness of the presented one-shot and two-shot color-stripe imaging 
schemes. 



1 Introduction 

Triangulation-based structured lighting is one of the most popular ways of active 
range sensing and various approaches have been suggested and tested. Recently, 
interests have been developed in rapid range sensing of moving objects such as cloth, 
human face and body in one or slightly more video frames, and much attention has 
been paid to the use of color to increase the number of distinguishable patterns in an 
effort to decrease the number of structured-light projections required for ranging a 
scene. This paper discusses the design of stripe patterns and the selection of colors to 
assign to stripe illumination patterns that minimizes the effects of object surface 
colors, system noise, nonlinearity and limitations in camera/projector resolution for 
real-time range imaging in a single video frame (“one-shot”) and for near real-time 
imaging in double video frames (“two-shot”). 

Among the systems that employ a single illumination source (projector) and a 
single camera, those that project sweeping laser light plane, black-and-white (BW) 
stripe patterns, gray-level stripes have been well investigated [ 1 ] [2] [3] . Since they are 
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projecting multiple light patterns or sweeping laser stripe, they are appropriate for 
stationary scenes. Hall-Holt and Rusinkiewicz have recently suggested a method 
based on time-varying binary BW patterns for near real-time ranging in four video 
frames, but object motion is assumed to be slow to keep “time coherence” [4], 

Various approaches have been made to one-shot or near one-shot imaging. One 
class of methods is those that project continuous light patterns: Tajima and Iwakawa 
used rainbow pattern with continuous change of light color [5], Huang et al. used 
sinusoidal color fringe light with continuous variation of phase [6], and Carrihill and 
Hummel used gray-level ramp and constant illumination [7]. Although all these 
methods can, in principle, produce range images with high speed and high resolution 
only restricted by system resolution, they are highly susceptible to system noise, 
nonlinearity and object surface colors. Another class of approaches includes those that 
use discrete color patterns: Davis and Nixon designed a color-dot illumination pattern, 
Boyer and Kak developed color stripe patterns that can be identified by a color coding 
of adjacent stripes, and Zhang et al. also developed a color stripe pattern based on a 
de Bruijn sequence and stripes are identified by dynamic programming [8] [9] [ 10] . 

Although the color-dot and color-stripe methods are less sensitive to system noise 
and nonlinearity compared to those with continuous light patterns, they are also 
significantly affected by object color reflectance and their range resolution is limited 
by stripe width. Caspi et al. presented a three-image-frame method that can overcome 
the ambiguity in stripe labeling due to object surface color, but its real-time 
application has not been explicitly considered [11]. 

Most of the color stripe-based methods suggest design of color patterns that can be 
uniquely identified in the illumination space, but little explicit attention has been paid 
to the selection of colors. In this paper, we investigate the selection of colors for 
illumination stripe patterns for maximizing range resolution, and present a novel two- 
shot imaging method which is insensitive to system noise, nonlinearity and object 
color reflectances. 

The rest of this paper is organized as follows. Section 2 describes a design of color 
multiple- stripe pattern, Section 3 discusses selection of colors for one-shot imaging, 
and Section 4 presents a method for two-shot imaging. In Section 5, generation and 
identification of multiple stripe patterns are discussed. Section 6 presents the 
experimental results and Section 7 concludes this paper. 



2 Multiple-Stripe Patterns for Structured Light 

A typical triangulation-based ranging system with structured light consists of an LCD 
or DLP projector as illustrated in Fig. 1 . Design of a good light pattern is critical for 
establishing reliable correspondences between the projected light and camera. For 
real-time (30 Hz) one-shot imaging, only one illumination pattern is used and fixed in 
time. For two-shot imaging, on the other hand, the two light patterns should alternate 
in time and the projector and camera should be synchronized. Since projectors and 
cameras with frame rates higher than 60 Hz are now common commercially, near 
real-time imaging with two shots becomes much more feasible than before. In this 
section, we describe methods for selecting stripe patterns and colors for one-shot and 
two-shot imaging. 
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Fig. 1 . Triangulation-based structured-light range imaging 



The most straightforward way of generating unique color labels for M stripes 
would be to assign M different colors. In this case, the color distances between the 
stripes are small and this simple scheme can be as sensitive to system noise and 
nonlinearity as the rainbow pattern [5]. A small number of colors with substantial 
color differences are more desirable in this regard, but global uniqueness is hard to 
achieve due to the repeated appearance of a color among M stripes. This problem has 
been addressed and spatially windowed uniqueness has been investigated [9] [10]. 
Instead of using one stripe for identification, we can use a small number (e.g., k ) of 
adjacent stripes such that sub-sequence of k consecutive stripes is unique within the 
entire stripe sequence. 

It can be easily shown that from N different colors, N k different stripe sequences 
with the length k can be made. When adjacent stripes are forced to have different 
colors, the number of uniquely identified sequences is [9]: 

n{N,k) = N{N- 1)* _1 (!) 

If a single stripe is used with N = 3 colors, for instance, only 3 stripe labels are 
attainable, but the number of uniquely identifiable labels increases to 6, 12, 24 and 48 
for k = 2, 3, 4 and 5. The entire pattern should be generated such that any sub-pattern 
around a stripe can be uniquely labeled. 

It may be noted that with a binary BW pattern ( N = 2 ) it is impossible to increase 
the distinct labels in this way of generating subsequences since 
n(2,k) = 2- (2 — 1)* _1 = 2 . In other words, for one-shot imaging, the number of 

identifiable sub-patterns remains fixed regardless of the length k. Hall-Holt and 
Rusinkiewicz used multiple frames in time for increasing it with binary BW 
stripes [4], 

If the color subpatterns are simply concatenated, not every stripe is uniquely 
identifiable [9]. We prefer having the entire stripe pattern designed such that the 
windows of subpatterns overlap but every stripe can be identified by a unique 
subpattern of k consecutive stripes centered at the stripe. Zhang et al. have employed 
this scheme using the de Bruijn sequence [10]. 

The design of the stripe pattern requires several choices for: the total number of 
stripes M, the length of the subpattern k and the number of colors N. In what follows, 
we turn into a discussion of criteria for choosing the stripe colors for high-resolution 
range imaging. 
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Fig. 3. Intensity profiles of RGB stripes 



3 Color Selection for One-Shot Imaging 

For high-resolution ranging, the stripes should be as narrow as possible. For a given 
resolution limit from a set of projector and camera, stripes can be better detected with 
higher color contrast between the stripes. The most common choice of colors has been 
among R, G, B, cyan (C), yellow (Y) and magenta (M), black, and white. For the 
stripes that appear as thin as 1.5-2 pixels in the camera, the use of some colors 
confound color stripe detection. 

Let us first consider the case of using only deeply saturated three primary colors R, 
G and B, and addition of other colors later. Fig. 2 shows a chromaticity space (CIF-.v v 
space). The colors that can be represented by the system RGB filter primaries are 
limited to the triangle shown in Fig. 2. For the sake of simplicity in discussion, we 
assume that the filter characteristics of projector and camera are identical. The image 
irradiance in the camera 1(A) can be represented as: 

m=g g saw), ( 2 ) 

where S( A) is the object reflectance, E(A) is the illumination from the projector, and g s 
is the geometric shading factor determined by the surface orientation and illumination 
angle. When object reflectances S(/l)s are neutral (Fig. 4(a)), the received colors are 
determined by the projector illumination E(A), i.e., RGB stripe illumination, which 
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(a) 




R 



Fig. 4. Chromaticity diagram: (a) neutral object reflectances, (b) dispersed chromaticities under 
RGB stripe illumination and (c) dispersed chromaticities under RGBCMY stripe illumination 




Fig. 5. Chromaticity diagram: (a) colored object reflectances, (b) dispersed chromaticities under 
RGB stripe illumination and (c) dispersed chromaticities under RGBCMY stripe illumination 



splits the reflections from the neutral surfaces to the RGB points, i.e., the vertices in 
the chromaticity triangle (Fig. 4(b)). The one-dimensional hue value can be used 
effectively for separating the distinct colors. The more saturated the three stripe- 
illumination colors are, the farther apart the image stripe chromacities are and 
therefore the easier to detect against each other. 

For contiguous RGB stripes, other colors than RGB appear in the image from the 
linear combination of RGB colors due to the limited bandwidth of the 
camera/projector system as illustrated in Fig. 3. The linear combinations of RG, GB 
and BR colors are generated at the boundaries. Fig. 4(a) shows neutral reflectance in 
the chromaticity triangle, and 4(b) shows the spread of the chromaticities at the 
boundaries. With only RGB illumination colors, thresholding by hue is effective in 
the presence of false boundary colors as illustrated in Fig. 4(b). Hue thresholding also 
works well for the reflections from moderately colored object surfaces as illustrated in 
Fig. 5. Fig. 5(a) and (b) depict the chromaticities of colored objects and spread 
chromaticities under RGB stripe illumination, respectively. 

When the stripe width is 1.5-2 pixels, the pixels with the false colors around the 
stripe boundaries is substantial compared to the stripe colors. When additional colors 
such as CMY are used, the false boundary colors significantly confound stripe 
detection and identification since the CMY colors are linear combinations of the RGB 
primaries. The false colors can easily break subpatterns. Fig. 4(c) and 5(c) depict the 
chromaticity spreads under RGBCMY illumination from neutral and colored objects, 
respectively. It can be seen that additional false colors appear and there is no easy 
way to separate them from the stripe colors. 
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We find that for high-resolution imaging, the use of only RGB colors results in the 
best resolution and the smallest errors. This restricts the number of color code N to 3. 
White can be added if surface shading is insignificant and black might be added when 
ambient light is low. 

Other 3-color primaries such as CMY would perform similarly to RGB only if the 
chromaticity distances of the primaries were as large as those of RGB. However, the 
color distances between CMY (as synthesized from the linear combinations of RGB) 
are substantially smaller. (See Fig. 2.) The CMY colors in Fig. 2 have substantially 
less saturation than RGB. The only way to keep the CMY distances comparable to 
those of RGB is to employ narrowband CMY color filters in the camera instead of 
RGB, but such commercial cameras are rare in reality. 

If object surfaces have substantially saturated colors, the reflectance chromaticities 
are dispersed so widely that even strong RGB stripes cannot separate them for 
detection and extra information is needed. 



4 Two- Shot Imaging Method 



When synchronized high-speed camera and projector are available, more than one 
frame can be used to discount the effects of highly saturated object colors and 
ambient illumination in range sensing. Many commercial cameras and projectors offer 
external trigger and frame refresh rate higher than 60 Hz. We present a two-shot 
imaging method that uses two video frames and stripes with highly saturated projector 
colors. 

When projection of two light colors E,(A) and E 2 (A) alternates in time, the 
following two images can be obtained: 

jl 1 (A) = g e S(A)[E 1 (A) + A(A)] (3) 

\l 2 (A) = g e S(A)[E 2 (A) + A(A)]’ 



where the effect of ambient illumination is included. Since it is assumed that objects 
are stationary during two consecutive video frames in a short time interval, g B and 
S(A) are common to both the images. 

Caspi et al. used an extra image with the projector turned off to estimate the 
influence of the ambient illumination from I A (A)=g B S(A)A(A) [11]. After discounting 
A(A ), the ratio of the images will be dependent only on the designed illumination 
colors without any influence of surface color and shading, i.e.,: 

I 2 (A)-I a (A) _ E 2 (A) (4) 

/,a)-/ A a) E,{A) ' 



With the commonly used assumption of spectral smoothness of S’(/j.) in each color 
channel and some appropriate color calibration described in [11], the responses in the 
color channels can be decoupled and analyzed independently with the following 
ratios: 



R 2 g 2 

V ~G, 



lh 

B 



and 
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While this is an effective way of discounting objects colors if image values are in 
the linear range of projector and camera responses and many combinations of color 
ratios can be produced for stripes, the ratios become unstable when l t and / 2 are small 
due to small g s and S(A) values and when image values are clipped on highly 
reflective surfaces. The geometric factor gg is small on shaded surfaces whose surface 
orientation is away from the illumination angle. A “three-shot” method by Caspi et al. 
uses an extra shot to measure the ambient illumination I a (A) for its compensation and 
relies on the color ratios [11]. However, the color ratios are highly unstable for bright 
and dark regions. 

Instead of assigning many RGB projector colors and identifying them in the linear 
range, we seek a small number of stripe signatures from the sign of color difference to 
reduce the sensitivity to light intensity and ambient light without the third image for 
the estimation of ambient light. From Equation 3, the difference of two images is 
given as: 

A IU) = I 2 U)-IM) (5) 

= gg sa)[E 1 a)-E l a)], 

= g g S(A)AE(A) 

and the ambient illumination is discounted. In Equation 5, it can be seen that though 
AI(A) is significantly affected by g^and S(A), its sign is not since g^and S(/l) are both 
positive. We can derive a few stripe labels from the sign of image difference. 

When color channels are decoupled with the same color assumption described 
above, the differences of RGB channel values are: 

AR = g s S R AE R ( 6 ) 

■ AG = g e S G AE G , 

AB = g s S B AE B 

where S R , S° and .S'” are the object reflectances in R, G and B channels, respectively, 
and A £*, A and A/f” are the intensity differences of projection illumination in R, G 
and B channels, respectively. From the positive and negative signs of AR, AG and A B, 

we can obtain 2 =8 distinct codes to assign in a projector stripe. If we construct 

subpatterns with N= 8, n(N,k) = 8-7 (k " I) unique subpatterns can be generated according 
to Equation 1 . In the design of subpatterns, however, it was observed that the spatial 
transition of colors over the stripes in multiple channels make false colors due to the 
inconsistency of color channels and those false colors are not easily identifiable. If we 
allow spatial color transition only in one of the RGB channels, only three types of 
transitions are allowed from one stripe to another. In this case, it can be shown that 
the number of unique subpatterns for the length k in two-shot, “n2” is given as: 

n2(m,k) = 2 m m k - 1 (7) 

where m is the number of possible spatial color transitions, e.g., m = 3 for RGB colors 
and m= 2 for GB colors. To maximize the AR, AG and AS values for good 
discriminability between the positive and negative signs, the intensity difference of 
projection color should be maximized; we assign the minimum and maximum 
programmable values for the frame 1 and 2. 



102 



C. Je, S.W. Lee, and R.-H. Park 



The channel value differences in Equation 6 are also affected by small values of g e 
and S(A), but we claim that their influence is much smaller with the difference than 
the ratio for given system noise. Furthermore, the effect of system nonlinearity and 
RGB color clipping is much less pronounced with the signs of the difference. To 
demonstrate the efficacy of this approach, we use only presented method in our 
experiments with only two color channels, G and B, in our experiments, i.e., with 
N=2 Z =4 codes. 



5 Multiple- Stripe Pattern Synthesis and Identification 

The requirements for a program that generates the subpatterns are as follows: 

(a) Different colors or codes should be assigned to adjacent stripes to make 
distinguishable stripes. 

(b) The subpattern generated by the ith stripe should be different from any 
subpatterns generated up to the (i-l)th stripe. 

For the two-shot imaging, the following extra requirements should be added: 

(c) Only one channel among RGB should make a color transition between the 
adjacent stripes. 

(d) Colors in each stripe in the second frame should be the reverse of those in the 
first frame in each channel. The reverse of maximum value is the minimum 
value and vice versa. 

There is a tradeoff to make between the number of total stripes to cover a scene M 
and the number of stripes in a whole pattern n(N, k ). For high-resolution imaging M 
should be kept large and n(N, k ) should be close to M for a unique encoding of all the 
stripes. Otherwise, the whole pattern should be repeated for the scene, and its 
subpatterns are not globally unique. This means that for small N, the length of the 
subpattern k should be large. Wide subpatterns, however, are not reliable near the 
object boundaries and occlusions. The best compromise we make for one shot 
imaging (“« 7”) is to have ;?7(3, 7)=192 with k=l for M = 400 and let the pattern appear 
twice. Stripe identification or unwrapping is not difficult since identical subpatterns 
appear only twice and they are far apart in the image. For the two-shot imaging with 
only GB channels, n2{ 2, 7)= 2 2 -2 71 =25 6 with k-1 . The generated patterns for one- 
shot and two-shot image are shown in Figs. 6 and 7. 




Fig. 6. RGB pattern for one-shot imaging: 192 unique subpattems 
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Fig. 7. GB patterns for two-shot imaging: 256 unique subpatterns in each frame 



Stripe segmentation in the received image can be done mostly based on two 
methods: stripe color classification and edge detection by color gradients. Both should 
work equivalently for imaging with thick stripes. For high frequency stripe imaging, 
however, color gradients are not stably detected after smoothing and we prefer stripe 
color classification. 



6 Experimental Results 

We have carried out experiments with a number of objects using the described one- 
shot and two-shot methods. We used a Sony XC-003 3-CCD 640x480 color camera, 
and an Epson EMP-7700 1024x768 projector. The encoded patterns are projected 
onto the various objects by the projector, and the camera captures the scene. We did 
not use any optimization algorithm for extracting stripe identification labels since the 
direct color classification and decoding algorithms work well. The projected patterns 
all consist of stripes with one-pixel width, and the width of stripes in the captured 
images is around 2 pixels. We used RGB stripes with k=l and «7=192 for one-shot 
imaging and GB stripes with k=l and n2 = 256 for two-shot imaging, as described in 
Section 5. We used only 2 color channels because they keep n2 large enough, and the 
reason why we chose GB instead of RG is that the G and B color channels have 
slightly less crosstalk than the R and G channels in our camera-projector setup. 

Fig. 8 shows the results from a human face with one-shot and two-shot imaging. 
For one-shot imaging, Figs. 8(a), (d), (b) and (f) show the subject under white 
projector illumination, stripe pattern projection, pseudo-color display of identified 
stripes from subpatterns, and the range image from the identified stripes, respectively. 
For two-shot imaging. Fig. 8(e), (c) and (g) show the two stripe patterns, pseudo-color 
display of identified stripes from subpatterns, and the range image from the identified 
stripes, respectively. Since the face has moderate colors, both the methods work well. 

One-shot and two-shot imaging has also been tested with a flat color panel with 
highly saturated color patches. Fig. 9(a), (b), (c) and (d) shows the color panel under 
white light, one of the color patterns for two-shot imaging, the range image from one- 
shot imaging, and the range image from two-shot imaging, respectively. It can be seen 
that the strong colors of surface reflectance confound the one-shot imaging 
significantly. Fig. 10(a), (b), (c) and (d) show chromaticity plot from the human face, 
those from the color panel under white light, and from RGB stripe light, and the plot 
of panel colors in GB space from two-shot imaging. 
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(a) (b) (c) (d) 




(e) (f) (g) 



Fig. 8. Results from a human face: (a) A human face under white light, (b) identified stripes 
from subpattern from one-shot imaging (pseudo color assignment for each stripe), (c) from two- 
shot imaging (pseudo color assignment for each stripe), (d) color stripes for one-shot imaging, 
(e) two stripe patterns for two shot imaging, (f) range image from one-shot imaging, and (g) 
range image from two-shot imaging 




(a) (b) (c) (d) 



Fig. 9. Experiments with a color panel: (a) under white light, (b) one of the color patterns for 
two-shot imaging, (c) range image from one-shot imaging, and (d) range image from two-shot 
imaging 






High-Contrast Color-Stripe Pattern for Rapid Structured-Light Range Imaging 



105 




Fig. 10. Color plots: (a) Chromaticities from the human skin under RGB stripe light, (b) those 
from the color panel under white light, and (c) those from RGB stripe light, and (d) plot of 
panel colors in GB space from two-shot imaging 




Fig. 12. Two-shot imaging with a cylindrical object: (a) under white illumination, (b) one of the 
two stripe patterns, (c) range image with the presented method, and (d) with the method in [11] 
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As can be seen from Fig. 10(a), the face colors are moderate and good for the one- 
shot imaging. However, the colors from the panel are so strong that projection of 
RGB colors cannot reliably separate the colors of stripe-like regions on the captured 
image into the different regions in the chromaticity as shown in (b) and (c), while we 
can easily identify GB -combination codes in (d). 

Fig. 11 shows (a) a strongly colored object (Rubik’s cube), (b) its stripe 
segmentation and (c) the high-resolution range result using the presented two-shot 
method. The stripes are properly segmented in spite of the strong surface colors and 
specular reflections, and even so is the black paper region at the bottom. Note that the 
range result is sufficiently good despite the discontinuities of the multiple- stripe 
sequence in the black lattice regions. Fig. 12 compares the result from a colored 
cylinder with the proposed two-shot method and that with the method of Caspi et al. 
in [11]. It can be seen that the presented method has the advantage in the bright (or 
highly saturated) and dark regions. 



7 Conclusion 

For rapid range sensing, we described a design of multiple- stripe patterns for 
increasing the number of distinguishable stripe codes, discussed a color selection 
scheme for reducing the ambiguity in stripe labeling in one-shot imaging, and 
presented a novel method for two-shot imaging that is insensitive to object color 
reflectance, ambient light and limitations in projector/sensor resolution. We showed 
that maximizing color contrast between the stripes in one-shot imaging reduce the 
ambiguities resulting from system resolution and object colors to some degree, and 
the new method of utilizing color differences in two shot imaging further reduce the 
ambiguities resulting from colored object surfaces, ambient light and sensor/projector 
noise and nonlinearity. By using the signs of color differences instead of color ratios 
in two-shot imaging, we can obtain more reliable information in the bright, dark, and 
strongly-colored regions of objects and also minimize the number of shots in 
multiple-frame imaging. 
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Abstract. An image sequence-based framework for appearance-based 
object recognition is proposed in this paper. Compared with the meth- 
ods of using a single view for object recognition, inter-frame consistencies 
can be exploited in a sequence-based method, so that a better recog- 
nition performance can be achieved. We use the nearest feature line 
method (NFL) [8] to model each object. The NFL method is extended in 
this paper by further integrating motion-continuity information between 
features lines in a probabilistic framework. The associated recognition 
task is formulated as maximizing an a posteriori probability measure. 
The recognition problem is then further transformed to a shortest-path 
searching problem, and a dynamic-programming technique is used to 
solve it. 



1 Introduction 

Appearance-based methods [2] [9] [10] [12] [13] [14] [15] [19] emphasize the use of 
view-based representations of objects, which are constructed from a set of views 
of an object in a pre-processing (or learning) stage, for object recognition or 
tracking. The collection of views is usually recorded in a compact way through 
principle component analysis (PCA) [10] [12], support vector machine (SVM) 
[13] [14] or neural networks [15] [19]. In the past, Murase and Nayar [10] observed 
that all the training feature vectors (e.g., vectors in association with a PCA 
representation) of an object consist of a manifold in the feature space. They 
approximated the manifold by using a spline interpolation for the feature vec- 
tors of a set of sampled views. In addition, Roobaert and van Hulle [14] used 
SVM and Roth, Yang and Ahuja [15] used sparse network of winnows (SNoW) 
for modelling the sampled views. Appearance-based techniques can also be used 
for recognizing objects in a cluttered environment [12] [13] and for tracking long 
image sequences or sequences across views [2]. 

Object recognition via linear combination [16] [8] [1] [17] is an interesting and 
informative concept received considerable attentions in recent years. In [16], Ull- 
man and Basri demonstrated that the variety of views depicting the same object 

* This work was done while he was a research assistant at Institute of Information 
Science, Academia, Taipei, Taiwan. 
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under different transformations can often be expressed as the linear combination 
of a small number of views, and suggested how this linear combination property 
may be used in the recognition process. In [17], Vetter and Poggio proposed a 
method based on linear object classes for synthesizing new images of an object 
from a single 2D view of the object by using corresponding feature points. In 
addition to recognizing or synthesizing views of different poses, linear combi- 
nation has also been shown very useful for visual recognition or view synthesis 
under different illumination conditions. For example, Belhumeur and Kriegman 
[1] have shown that a new image under all possible illumination conditions can 
be expressed as a linear combination of some basis images formed by a convex 
polyhedral cone in R n if the illumination model is Lambertian. 

Recently, a linear method called the nearest feature line (NFL) was proposed 
for object recognition [8] [7]. It uses the collection of lines passing through each 
pair of the feature vectors belonging to an object to model appearances of this 
object. Instead of using splines [10], the NFL method uses a linear structure to 
represent the appearance manifold, which has a close relationship with the linear- 
combination approaches mentioned above. In essence, infinite feature vectors 
of the object class can be generated from finite sample vectors with the NFL 
method. Note that NFL can be treated as an extension of the nearest-neighbor 
(NN) method, and it has been theoretically proven that the NFL method can 
achieve a lower probabilistic error than NN when the number of available feature 
points for each object class is finite and the dimension of a feature space is high 
[20] . An experimental evaluation of the NFL method in image classification and 
retrieval was given in [7], which shows that it can make efficient use of knowledge 
about multiple prototypes of a class to represent that class. 

In this paper, a framework for sequence-based object recognition is proposed 
by employing the concept of feature lines. In particular, by further considering 
inter-feature-line consistencies, our method can use several images of an object 
as the training input for building an image-sequence-based recognition system. 
More specifically, the database built in our work contains information about 
possible moves between views in the database. Therefore, it contains much more 
information than an unordered set of views, and our method would only be 
applicable to problem domains where this information is available. The main idea 
of our method is that it tries and finds objects that not only match the individual 
images, but also makes sure that the sequence of views in the query could match a 
similar sequence of views in the database. In other words, the input database does 
not consist of isolated example images; but rather that these images are related 
to each other via motion consistency. Hence, our method would work much better 
as more images are added, which means that the performance improvement over 
adding more images would be relatively better for this method than for other 
methods. In our framework, a recognition task is formulated as a problem of 
maximizing an a posteriori probability measure. This problem is further reduced 
to a most-probable-path searching problem in a specially designed graph, which 
can be effectively solved with dynamic programming. 

This paper presents a general framework for sequence-based object recog- 
nition, which can be used for real-world applications such as face recognition. 
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For example, by incorporating our recognition method with existing face de- 
tection and tracking algorithms [4] [18], this framework can be used to achieve 
sequence-based face recognition for person identity verification. The remainder 
of this paper is organized as follows. In Section 2, we present a probabilistic for- 
mulation for sequence-based object recognition by employing inter-feature-line 
consistencies. Section 3 gives the main algorithm of this paper. Some experimen- 
tal results are shown and discussed in Section 4. Finally, we make conclusions in 
Section 5. 

2 Object Recognition by Using Inter-feature-Line 
Consistency 

In the following, we will first review the nearest-feature-line (NFL) method [7] [8] 
in Section 2. A. Then, we will characterize inter-feature-line consistencies formu- 
lated in our work in Section 2.B. 

2.1 Approximate Manifold with Feature Lines An Introduction 

Assume that we have M objects, and let X c = {x^i = 1, . . . ,N C } be a set of 
N c training feature vectors belonging to object c, c = 1, 2, . . . , M. A feature line 
(FL) (i y^ j) of object c is defined as a straight line passing through x£ and 
x.f. A FL space of object c is denoted by S c = {x^xj|l < i,j < N c ,i ^ j}, where 
x^x 0 - = XjXf, 1 < i,j < N c , and the number of feature lines in S c , denoted by 
K c , is jVc ^ c ~ 1 ) . When there are M classes in the database, M such FL spaces 
can therefore be constructed, composed of a total number of N tota i = K c 

FLs. Let r = S 1 U S 2 U . . . U S M be the collection of all N tota i feature lines. The 
FL distance from a query q to some feature line xpcf ( i y^ j) is defined as 



where 1 1 . 1 1 is the 2-norm and p is the projection point of the query q onto x,x r 
The projection point p can be computed as p = x, ; +/i(x ; — x, : ) and the position 
parameter fi £ R is 



Figure 1 (a) illustrates FLs and FL distances. Note that when 0 < fi < 1 , p is 
an interpolating point between x, and x ; . Otherwise, p is a extrapolating point 
either on the x, side (when /i > 1) or the Xj side (when fi < 0). NFL recognizes 
q as object c* by computing the minimal FL distance between all features lines 
contained in r as shown below. 



<^(q, XjXj) = ||q-p 



(1) 



(q-XiWq-xA 



( 2 ) 





min d( q,x?xO 

Or. J 



( 3 ) 



= mi n d(q,x?xO 
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Fig. 1. (a)The concepts of a FL xfxj and the FL distance from a query x to x, x , . 
(b)The concept of neighboring FLs. The feature lines drawn with dash lines are neigh- 
boring FLs of x£x£, 



2.2 Inter Feature-Line Consistencies 

In essence, the NFL method [8] [7] only uses the distance-from-manifold infor- 
mation for object recognition. In our framework, we further extend the NFL 
method to a new one where inter-feature-line consistencies are incorporated by 
using the concept of neighboring FLs. Given a feature line x£x£,, we denote 
# c,u,w = {x£x£,} u {x£xSj 1 < Wl < N C ,W 1 ± u,Wi ± w} U {x^x^l 1 < 
Mi < N c ,ui ^ w,ui ^ u} to be the set of its neighboring FLs. An illustration 
of neighboring FLs is shown in Figure 1(b). The number of neighboring FLs of 
x£x£, is therefore equal to 2N C — 3. 

Let Aq = {qo,qi . . . , qz,} be a sequence of L + 1 views. In addition, let 
F = {I, £ r\i = 0, 1, . . . , L) be a set of L + 1 FLs where each q, matches its 
projection point p; on lj. We recognize yl q by finding F* that maximizes the 
following a posteriori probability: 

F* = arg max P(F\A„) 

= arg maxP(l 0 , . . . ,l L |q 0 ,... ,q L ) 

= arg max?(q 0 ,... ,q L |l 0 ,... ,1 l)P(1 0 ,... ,1l) 

& 

= arg maxP(l 0 )P(q 0 |lo) p ( 1 i+il 1 *) p ( t li+il 1 i-l-i)) 

i=0, ,L- 1 

where the last equality holds by assuming that 

(i) P(q,|l ; ;) ; i = 0, . . . , L are independent of each other, and 

(ii) P(lo, . . . ,1 l) can be modelled by a first-order Markov chain. That is, 
P(l i |l i _i,l i _ 2 , ... ,lo) = P(li|lj-i) for alH = 1, . . . ,L. 

To evaluate (4), the transition probabilities between the feature lines 1* and 
lj +1 , P(lj + i|li), i = 0, . . . ,L — 1, and the likelihoods P(qi|l,), i = 0, . . . , L, have 
to be specified. First, we can see that P(qj|lj) is a probability measure in asso- 
ciation with the similarity between y * and q^, where y,; is the projection point 
of q, onto lj. Hence, P(q, jlj) can be set as being decreased with ||q, — y.;||, the 
distance between the observation q; to its projection point y We thus refer this 
probability to as the probability caused by the distance from appearance mani- 
folds (PD AM). Second, because the image sequence to be recognized consists of 
consecutive views of an object, P(lj + i|lj) is larger when lj+i is a more reasonable 
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consequent of 1, by considering motion continuity. We refer P(l, + i|lj) to as the 
probability caused by motion continuity (PMC). Nevertheless, note that here the 
concept of ’’motion continuity” has nothing to do with velocity or rotation, but 
rather only whether the relationship between images in the database matches 
the relationship between two images in the query. 



3 Probabilistic Framework and Main Algorithm 



3.1 Probability Distribution Setting for PDAM and PMC 

To evaluate (4), we make the following assumptions: 

First, the PDAM is modelled as a Gaussian distribution, P(qfc|x£x£,) = 
exp(—d 2 . uw . k /2a 2 )/Z, where a is a chosen constant, Z is a normalization con- 
stant which normalizes P(qfc|x£x£,) to be a probability density function, and 

dc;u,w;k — d(q/c , x((xO,) , (5) 

for k = 0, . . . , L, c = 1, . . . , M, 1 < u,w < N c and u ^ w. 

Second, let the PMC be defined as follows: 



P( Willi) 



0; If lj-|_i is not a neighbor of 1,, 
i Otherwise. 



( 6 ) 



where IV(lj) = 2 N c — 3 is the number of neighboring FLs for 1,. 

Third, we assume equal priori probabilities for all the FLs by setting P(l) = 
1 /Ntotai for 1 e P. 

Taking a natural log of (4), the following formulation can be derived: 



tp* = arg maxln (P(l 0 )P(q 0 |l 0 ) P(l i+ i|l i )P(q J ;+i|li+i)) 

i=0,... ,L— 1 

= arg minln J(P; A q ), (7) 

i v 

= arg min J(<F; A q ), 

i? 



where 



J(^;A q ) 




oo if P(li+i|lj) = 0; for some i G {0, ..., L — 1}, 

J2i = o d(q.i, 1 ! ) 2 /2<t 2 - J2f=o M 2N C - 3) - lnN to taV, Otherwise. 

(8) 



From (7), to find S'* that maximizes the a posteriori probability is equivalent 
to find P* that minimizes the objective function J defined in (8). Note that 
it is computationally intractable to use brute force for computing <F*. In this 
work, the PDAM and PMC are encoded in a matching graph, and dynamic 
programming is adopted for solving this minimization problem. 
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3.2 Construction of Matching Graph 

We construct a graph G containing L + 1 levels. There are N tota i nodes in 
each level, where Ntotai is the total number of feature lines defined before. For 
the fc-th level, {k = 0, . . . , L), a number of Ntotai nodes, denoted by n x c x c , k , 
c£ {1,2,... , M} and u, w € {1, 2, . . . , iV c }, are constructed in association with 
it. In addition to these nodes, we also construct a source node n_i and a sink 
node riL + 2 - Then, some edges are constructed by connecting nodes in G as 
follows. The source node n_i and sink node til + 2 are fully connected to nodes 
at level 1 and level L + 1, respectively. For each adjacent levels k and k + 1, 
k = 0, ... ,L — 1, there is an edge e(x{j i x{j Ji ; x£ 2 x£, 2 ; k) linking n x c x c ;fc to 
n x o u2 ^ . k+1 if x^x^ 6 * cm , the set of neighbor FLs of x^x^. Figure 2 
shows an illustration of the graph G in association with the case in which there 
are two objects and each object has four views. 



Level 




Fig. 2. An example of a graph G that is constructed for the case of two objects, where 
each object contains four views. There are therefore 6 = Co FLs for each object 

For each node n x c x c , k (except the source and sink nodes), a node score 
Sc;u,w;k is assigned to it by setting 

Sc;u,ui;k = d c . uw . k / <7 (9) 

This node score is used to encode the log likelihood of the PDAM in (8). In 
addition, the scores of the source and sink nodes, s_i and sl+ 2 , are both set to 0. 

Then, the cost of a node is defined from the principle of dynamic program- 
ming as shown in the following. First, the cost of the source node, cost{ri- 1 ), is 
set to zero. Then, the cost of each of the other nodes is defined recursively as 

cost(n^ c^r ;fc ) = s c - u . w . k +mm{cost(n x c iX c t . k ,) | n x c x c e O c - u . w . k }-, (10) 

for k = 1, . . . , L + 2, c = 0, . . . ,M and 0 < u, w < N c . Note that 0 c - u -w\k is the 
set of nodes having edges linking to w x e x e . k . 
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Fig. 3. Coil-100 data set. (a) The 100 objects in the Coil-100 data set. (b) Some views 
of ten of the objects shown in (a) (sampled by every 10°) 



Each path starting from the source node and ending at the sink node repre- 
sents a sequence of matches between a view belonging to A q and a FL belonging 
to r. The cost of the sink node, cost(n,L + 2 )j is then referred to as the minimal 
cost, and it is easy to verify that cost(n,L+ 2 ) = min^ J(^;yl q ) defined in (8). 
The path associated with the minimal cost or, equivalently, the shortest path 
from the source to the sink nodes, is then referred to as the matching (or opti- 
mal) path in this work. The FL associated with nodes in G (except the source 
and sink nodes) are then treated as a sequence of matched FLs of the sequence 
of testing consecutive views, /l q , and the object represented by these FLs then 
serves as the recognition result. 

In our work, to avoid recursive programming, the Dijkstra algorithm [5] is 
used to find the optimal path. For each node, an incoming edge with the lowest 
accumulated cost is retained in our approach. After finding the best incoming 
choice for all nodes, our process backtracks, from the sink to the source nodes, 
to obtain the optimal path. Except for the source and sink nodes, each node 
passed by the optimal path then represents a match between a view belonging 
to A q and a FL belonging to r. 



4 Experimental Results and Discussions 

4.1 Experimental Results 

Coil-100 Object Recognition. The Coil-100 data set [11] was widely used 
as an object-recognition benchmark [10] [14] [15]. In this data set, there are 100 
objects and each object has 72 different views (images) that are taken every 5° 
around an axis passing through the object. Each image is a 128x128 color one 
with R,G,B channels. Figures 3(a) and 3(b) show these 100 objects and some 
sampled views of ten of these objects, respectively. 

We follow the experimental settings in [15], which used only a limited number 
of views per objects for training. In our experiment, four different views per 
object (0°, 90°, 180° and 270°) were used for training, as shown in Figure 4(a), 
and the other 6800 (i.e., (72-4)*100=6800) images were used for testing. In other 
words, for each object, 6 features lines can be constructed from its 4 training 



views. 
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Fig. 4. (a) All the training views used in the experiments, (b) The sequences of views 
of the first object used in our experiment (with the lengths being 1 to 17) 

In our experiment, the number of consecutive views (i.e., L + 1) used for 
testing was set to be 1,3,5,... , 17. A sequence of consecutive testing views of 
an object are shown in Figure 4(b). Note that when the number of consecutive 
testing views is 1, our method degenerates to the NFL method with only a single 
image as its input. For each number of consecutive testing views, the average 
recognition error rate over 6800 tests was computed and recorded. Note that 
the sets of training and testing views are disjoint in our experiment, and thus 
the training views were dropped out when an image sequence was sampled for 
testing. 

In the first test, the input image format was set to be colored 16x16, which 
is the same as that in [14], and the 16x16 image directly serves as the feature 
vector. Figure 5 shows the experimental results by using our method. From 
this figure, it can be seen that our method can considerably improve the recog- 
nition performance of the NFL method as the number of consecutive testing 
views is increased. To further show that our probabilistic framework can inte- 
grate the consecutive visual clues better, two additional methods, NN- voting and 
NFL-voting, which are simply extended from the nearest-neighbor and nearest- 
feature-line methods to those employing image sequences by majority voting 
(i.e., the identity of recognition is determined by which receiving the maximal 
number of votes when every view in the sequence is independently recognized), 
respectively. As shown in Figure 5, our method outperform either NN- voting or 
NFL-voting because appearance-similarity and motion-continuity information is 
appropriately exploited. 

In the second test, we investigate how the recognition error rates of our 
method can improve as more training images were added. In this test, 32x32 
gray images from the same database were used, as those adopted in [15], and 
the 32x32 image directly serves as the feature vector. We vary the number of 
training views from 4 to 8 and Figure 6 shows these results. For example, when 
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object recognition (4 training views per object and a 1 6x1 6 color representation) 



—a— nn-voting 
o nfl-voting 



IS 0 ' 16 "-. 




the number of consecutive testing views 



Fig. 5. Object recognition results using the same image format (i.e., color 16x16 im- 
ages) as that in [14]. We compare our method (our) with nearest neighbor voting 
(nn-voting) and nearest feature line voting (nfl-voting) methods 
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Fig. 6. Object recognition results using the same image format (i.e., gray level 32x32 
images) as that in [15]. (a) 4 training views, (b) 6 training views (c) 8 training views 



the numbers of training examples (FLs) increase from 4 (6) to 6 (15) and 8 (28), 
the lowest recognition error rates decrease from 10.38% to 3.5% and 0.67%, 
respectively. From these experimental results, it is shown that the performance 
of the method would dramatically improve as more images are added. This 
figure also shows that our method can greatly improve the single-view-based 
NFL method and are better than the NN-voting and NFL-voting methods. 



Face Recognition. We perform face recognition on a face-only database 2 . 
There are 1280 image of 128 persons, where each person has 10 images with 
distinct poses or expressions per person. Each image size is normalized to be 
32x32, which directly serves as the features being used. Figures 7 (a) and (b) 
show all 128 persons and all 10 views of the first 5 persons in this face database. 
In the following experiments, we use 5 different views (that is, 15 FLs) per person 
in training as shown in Figure 7(c) (left part), and other (i.e., (10-5)*128=640) 
images are used for testing. Hence, for each object, 15 features lines can be 
constructed from its 5 training views. In addition, the number of consecutive 
testing views is sampled from {1,3,5}. Some examples of sequences of three 
views are shown in Figure 7(c) (right part). 

This database can be download from http://smart.iis.sinica.edu.tw/html/face.html 



2 
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Fig. 7. A face-only database, (a) shows all 128 persons in this database, (b) shows all 
10 views of the first 5 persons shown in the first row of (a) . (c) shows training views 
(left part) and some examples of sequences of three consecutive testing views of these 5 
persons (right part), respectively. Note that the training views are formed by collecting 
faces in columns 1, 3, 5, 7 and 9 shown in (b). 



Table 1 shows the experimental results using our method and the comparisons 
between our method and NN-voting and NFL-voting. In addition, it also shows 
that our method give the best performance. 

Table 1. Performance comparison between our method (our) with nearest neighbor 
voting (nn-voting) and nearest feature line voting (nfl-voting) methods on the face- 
only database. Recognition error rates in % on different numbers of consecutive testing 
views are shown. Our method gives the best performance. 





Ours 


NN- Voting 


NFL- Voting 


number of testing views (1) 


5.31 


6.71 


5.31 


number of testing views (3) 


0.26 


2.86 


2.6 


number of testing views (5) 


0 


1.55 


1.56 



4.2 Discussions 

To our best knowledge, no image-sequence-based methods have been used for 
testing these three databases in the past, and some existing results are introduced 
in the following for comparison. 

For Coil-100 database, the following results have been reported. When the 
image format is colored 16x16, and the number of training views is four, the 
PCA-and-Spline-Manifold [10] and the linear SVM method [14] achieved 12.4% 
and 13.1% recognition error rates, respectively (the NFL method achieved 15.6% 
in our testing). When the image format is 32x32 gray, the linear SVM method 
[14] and the snow-with image method [15] achieve 18.4% and 21.5% error rates, 
respectively (the NFL method achieved 25.3% in our testing). From Figures 5 
and 6(a), it can been seen that our method can outperform the existing results 
shown above when only 5 and 9 consecutive images of objects are used for colored 
16x16 and 32x32 gray image formats (four training views), respectively. 3 For the 
face-only face database, 13% recognition error rate can be achieved in [3] when 

3 The above results were all tested based purely on image intensity information. Roth, 
Yang and Ahuja [15] have further exploited edge information to achieve a better 
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their method is tested on a part of this database (100 persons) and 6 images 
per person are used as training images. From Table 1, it can been seen that our 
method has lower recognition error rates even when only four training views are 
used. The results reveal that the recognition performance can be considerably 
upgraded by appropriately exploiting useful visual clues contained in an image 
sequence. 

Some further remarks are addressed below: 

Remark 1 [Neighborhood Relationships]: Although each of our experiments 
uses a database with a single linear sequence of views as the object rotates 
about a single axis, other kinds of neighborhood relationships in the database 
could be exploited. For example, our approach can also be used for the case that 
a database contains views as the object is rotated about two axes [6]. In this 
case, the neighborhood relationship is two dimensional but not one dimensional, 
and our approach is still applicable for this case. 

Remark 2 [Feature Selection] : Although raw data was directly used as feature 
in our experiment, this is not the only choice. Our method can also use features 
produced by feature-extraction or feature-generation processes such as principal 
component analysis or linear discriminant analysis, and features generated in 
such ways would have chance to be helpful for either the computational efficien- 
cies or the recognition accuracies. 



5 Conclusions 

There are several characteristics of our framework for appearance-based object 
recognition using a sequence of views. First, we emphasize inter-feature-line con- 
sistencies. Second, we take both the probability caused by the distance from 
manifold, PDAM, and the probability caused by motion continuity, PMC, into 
considerations. Third, to handle the associated recognition problem, we construct 
a matching graph in which PDAM and PMC are incorporated, and transform 
this problem into a shortest path problem that can be effectively solved by using 
dynamic programming. The experimental results on the Coil-100 data set and 
the face-only database show that our method achieves high recognition rates for 
object recognition and face recognition. Our method thus provides an effective 
way for appearance-based object recognition using a sequence of views. 
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error rate, 11.7%, in their snow-with-image-and-edge method. Our result is better 
when the number of views in use is 15 without using edge information as shown in 
Figure 5 (four training views). 
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Abstract. Previous manifold learning algorithms mainly focus on uncovering 
the low dimensional geometry structure from a set of samples that lie on or 
nearly on a manifold in an unsupervised manner. However, the representations 
from unsupervised learning are not always optimal in discriminating capability. 
In this paper, a novel algorithm is introduced to conduct discriminant analysis 
in term of the embedded manifold structure. We propose a novel clustering 
algorithm, called Intra-Cluster Balanced K-Means (ICBKM), which ensures 
that there are balanced samples for the classes in a cluster; and the local 
discriminative features for all clusters are simultaneously calculated by 
following the global Fisher criterion. Compared to the traditional linear/kernel 
discriminant analysis algorithms, ours has the following characteristics: 1) it is 
approximately a locally linear yet globally nonlinear discriminant analyzer; 2) it 
can be considered a special Kernel-DA with geometry-adaptive-kernel, in 
contrast to traditional KDA whose kernel is independent to the samples; and 3) 
its computation and memory cost are reduced a great deal compared to 
traditional KDA, especially for the cases with large number of samples. It does 
not need to store the original samples for computing the low dimensional 
representation for new data. The evaluation on toy problem shows that it is 
effective in deriving discriminative representations for the problem with 
nonlinear classification hyperplane. When applied to the face recognition 
problem, it is shown that, compared with LDA and traditional KDA on YALE 
and PIE databases, the proposed algorithm significantly outperforms LDA and 
Mixture LDA, has better accuracy than Kemel-DA with Gaussian Kernel. 



1 Introduction 

Previous works on manifold learning [2] [6] [7] [9] focus on uncovering the compact, 
low dimensional representations of the observed high dimensional unorganized data 
that lie on or nearly on a manifold in an unsupervised manner. These algorithms can 
be divided into two classes: 1) algorithms with mapping function only for sample 
data. The sample points are represented in a low dimensional space by preserving the 
local or global properties of a manifold, like ISOMAP [16], LLE [12], Laplacian 
Eigenmap [1]; 2) algorithms with mapping function for the whole data space. Roweis 
[13] presented an algorithm that automatically aligns a mixture of local 
dimensionality reducers into a single global representation of the data throughout 
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space; Brand [3] presented a similar work to merge local representations and 
construct a global nonlinear mapping function for the whole data space. He 
[8 [proposed the simple locality preserving projections to approximate the Laplacian 
Eigenmap algorithm. All these algorithms are unsupervised and most of them are only 
evaluated on toy problems. 

In this paper, we propose an algorithm to utilize the class information for 
discriminant analysis in term of the manifold structure and applications in general 
classification problems such as face recognition. It is motivated by the following 
observations: First, previous works on manifold learning focus on exploring the 
optimal low dimensional representations that best preserve some characteristics of a 
manifold, while the best representative features are not always the best discriminant 
features for general classification task. On the other hand, the meaningful information 
may be lost in the dimensionality reduction, which in turn will degrade the posterior 
discriminant analysis based on the low dimensional data. Second, Linear 
Discriminant Analysis can only handle the linear classification problem and Kernel 
Discriminant Analysis [10] suffers from its heavy computation and memory cost 
although it can handle nonlinear cases in principle. The proposed algorithm is an 
efficient, low time and memory cost one for discriminant analysis based on the 
manifold structure. 

For a curved manifold, the globally linearly inseparable manifold may be easily 
separable locally. The intuition of this work is to place some local Linear 
Discriminant Analyzers on a curved manifold, then merge these local analyzers into a 
global discriminant analyzer via global Fisher criterion. In the first step, the 
traditional methods such as Mixture Factor Analysis (MFA) [1] can not be directly 
applied, since they can not guarantee that there are balanced samples for the classes in 
a cluster and it’s impossible to conduct Local discriminant analysis with only one 
class of samples in a cluster. In this work, we formulate this task as a special 
clustering problem and propose a novel clustering approach, called Intra-Cluster 
Balanced K-Means (ICBKM), to ensure that there are balanced samples for the 
classes in a cluster. 

Taking the advantage of the clustering results of ICBKM, the sample data are reset 
as clusters, and local discriminant analysis can be conducted in each cluster. The 
traditional way to recognize a new data using these local analyzers is to conduct the 
classification using the nearest local analyzer. In this work, the local analyzers are 
dependent in both learning and inferring stage, and the optimal discriminative features 
for each cluster are computed simultaneously. First, PCA is conducted in each cluster; 
then the posterior probability of each cluster for a given data, i.e. p(c|x) can be 
obtained. The optimal discriminative features for each cluster are computed by 
maximizing the global Fisher criterion, i.e. maximizing the ratio of the weighted 
global inter- and intra- scatters, where the scatters are computed based on 
the p(c|x) weighted representations for the samples. In the inferring stage, the low 
dimensional representation for new data is derived as the p(c|x) weighted sum of the 
projections from different clusters and the classification can be conducted using 
Nearest Neighbor (NN) algorithm based on the low dimensional representations. This 
algorithm can be justified in two different perspectives: 1) it automatically merges the 
local linear discriminant analyzers; and 2) it can be considered as a special kernel 
discriminant analysis algorithm with geometry-adaptive-kernel, in contrast to 
traditional KDA whose kernel is independent to the samples. 
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The rest of the paper is structured as follows. The intra-cluster balanced K-Means 
clustering method and global discriminant analysis based on the clustering results are 
introduced in section 2. In section 3, we present our justifications for the proposed 
algorithm in two different perspectives. The toy problem and the face recognition 
experiments compared with traditional LDA and KDA on YALE and PIE database 
are illustrated in section 4. Finally, we give the conclusion remarks in section 5. 



2 Discriminant Analysis on Embedded Manifold 

Suppose X = { %, , x 2 , ■ ■ • x N } be a set of sample points that lie on or nearly on a low 
dimensional manifold embedded in a high dimensional observed space. For each 
sampler, e r d , a class label is given as/, e {1, 2,---L] . Previous works on manifold 
learning are unsupervised and mainly focus on finding the optimal low dimensional 
embedding, i.e. the best low dimensional representations that preserve some 
characteristics of a manifold. However, the class information is not efficiently utilized 
in these algorithms and the derived representations are not always optimal for general 
classification task. 

Here we show how to utilize the class information to conduct nonlinear 
discriminant analysis in term of the manifold structure. A continuous manifold can be 
considered as a combination set of a series of open sets; and for the discrete sample 
data on it, they can be considered as the combination of a series of clusters. On the 
other hand, the globally linearly inseparable manifold may be easily separable on 
these local open sets. It motivates us to conduct local discriminant analysis in these 
local clusters, and then merge these local analyzers into a global discriminant 
analyzer. Following this idea, we first segment the sample data into clusters. The 
traditional clustering algorithms like K-means [11] and Normalized Cut [14] can not 
be applied to the problem we concern here since there may be only single class in 
some clusters, which makes the local discriminant analysis impossible. To address 
this, we have proposed a clustering algorithm called Intra-Cluster Balanced K-Means 
to ensure that the sample numbers for the classes in a cluster are balanced. Secondly, 
we search for the local optimal features in each cluster by following the global Fisher 
Criterion in which we maximize the ratio of the cluster weighted inter- and intra- 
class scatters. In the following subsections, we will introduce the two steps of our 
algorithm in detail, respectively. 



2.1 Intra-cluster Balanced K-Means Clustering 

K-means clustering algorithm aims at putting the more similar samples in the same 
cluster. It is unsupervised, thus it can not guarantee that there is a balanced number of 
samples for the classes in a cluster. Compared to the general clustering algorithms, the 
clustering problem we concern here has some special characteristics: 1 ) the class label 
for each sample is presented and it can be supervised; 2) its purpose is not only to put 
the similar samples in the same cluster, but also to ensure that the samples for the 
classes in each cluster should be balanced since the local discriminant analysis will be 
conducted in each cluster. Cheung [4] proposed a variation K-Means approach called 
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ICBKM: Given the class label set S t ={1,2 the data set X ■ the class label 
/ ( for each sample x in X and the cluster number K . 

1 . Initialization: Compute the standard deviation S of the data set X ; randomly 
select , X 2 , ■ ■ ■ , X K as the initial cluster centers, then assign each Y, to the 
cluster whose center is the nearest to Y, ; 

2. Reset Cluster Centers: For each cluster C k , reset the center as the average of 
all the samples assigned to cluster C k ; 

3. Assignment Optimization: For eachx e % , assign it to the cluster that makes 

the objective function smallest and the result satisfies the constraint in (1). 

4. Exchange Optimization: For each cluster, exchange the cluster labels for the 

sample in C k that is the farthest to cluster center and the sampled C k that is the 
nearest to cluster center. If no improvement, keep the previous labels. 

5. Evaluation: If current step has no improvement, return the final clustering results 

jC 1 ,C 2 , - ■ - , C* J ; else, go step 2; 



Fig. 1. Intra-Cluster Balanced K-Means Algorithm 



Cluster Balanced K-Means (CBKM), in which the concept cluster balance was 
proposed. However, it only ensures that the sample number in each cluster is balanced 
and does not take into account the class label information. To provide a solution to 
the special clustering problem, we propose a novel clustering approach named Intra- 
Cluster Balanced K-Means (ICBKM). ICBKM satisfies the requirement that there are 
balanced samples for classes in each cluster by adding an extra regularization term to 
constrain the sample number variation for the classes in each cluster. 

Formally, the objective function of ICBKM can be represented as: 



arg min V 



2L x -x ' 



K c k 



- + aXl^- A M 2 +/?ZZ| 



N k ~N k r 



k = 1 



(1) 



subject to: c k > 2 (k = 1, 2,..., K) 

£ _ £ 

where X is the average of the samples in cluster k; N is the sample number in 

cluster k; N is the average sample number for each cluster; N k is the sample number 

of the c-th class in cluster k: N k is the average sample number for each class in 
cluster k\ c k is the class number in cluster k: (X and ft are the weighting coefficients 
for the last two terms. 
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Original Data 



K-Means Clustering 




Cluster Balanced K-Means Clustering ICBKM Clustering 





Fig. 2. Toy problem on synthesized data ( OC = 0. 15, /? = 0. 1 ) 



In the objective function, minimizing the third term is to ensure that the classes 
have similar number of samples in a given cluster and the first two terms are the same 
as in CBKM. The objective function is not trivial and we can not obtain the close 
form solution directly. Here, we apply an iterative approach as traditional K-Means 
does. The Pseudo-code is listed in Figure 1. We present a new step for optimization 
called Exchange Optimization in ICBKM, in which the first term can be optimized 
while the last two terms are kept constant and the optimization conflict between the 
first term and last two terms that the assignment optimization step may face can be 
avoided. 

The comparison experiments on the synthesized data are conducted and Figure 2 
shows the results. It demonstrates that ICBKM presents intra-cluster balanced 
clustering results; and there is only one class in some clusters in the clustering results 
of K-Means and CBKM, which makes the local discriminant analysis impossible. 



2.2 Global Discriminant Analysis by Merging Local Analyzers 

Taking the advantage of the proposed Intra-Cluster Balanced K-Means approach, the 
sample data are segmented into clusters with balanced samples for different class. The 
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traditional way to utilize these clustering results is to conduct discriminant analysis in 
each cluster, then determine the class label for new data according to its nearest 
discriminant analyzer. In this way, the local analyzers is independent and the final 
classification use only part of the available information. We propose to utilize the 
global Fisher criterion to combine the local discriminant analyzers into a globally 
nonlinear discriminant analyzer. The global Fisher criterion maximizes the ratio of the 
weighed inter- and intra- cluster scatters. The entire algorithm has three steps and they 
are introduced in detail as follows: 

PCA Projections: In each cluster, Principal Components Analysis (PCA) is 
conducted for dimensionality reduction; moreover, like in Fisher-faces, PCA step can 
prevent the algorithm from suffering from the singular problem when the sample 
number is less than the feature number. In all our experiments, we retain 98% of the 

energy in term of reconstruction error. Thus, in each cluster, each data x. e % can be 
presented as a low dimensionality feature vector: 

Z-=(K a ?( Xi -x k ) k = l,-,K (2) 

in which W k pca is the leading eigenvectors. The conditional probability for cluster k 
given the data x , p(C k \ x) , can be obtained using a simple formulation [13]: 

p(C k \x) = p k (x)/^p ] (x) (3) 

M 

where p k (x) = exp{— a k (x) } and CL (x) is the activity signal of the data for cluster 

k. In our experiments, (X k (x) is set as the Mahalanobis Distance of the data in the 
PCA space of cluster k. 

Nonlinear Dimensionality Reduction by Following Global Fisher Criterion: as 

previously mentioned, the linear discriminant analysis can not handle the nonlinear 
classification problem; and the KDA suffers from the heavy computation and memory 
cost in the classification stage. Here, we propose a novel discriminant analysis 
algorithm to conduct nonlinear discriminant analysis while need not to store samples 
for the feature extraction of the new data. The optimal features for all the clusters are 
simultaneously computed in a closed form and the local discriminant analyzers are 
automatically merged into a globally nonlinear discriminant analyzer by following the 

global Fisher criterion. For each sample x t e X , it can be represented in cluster A: as a 

k 

low dimensional vector z t • The purpose of the algorithm is to find the optimal feature 

k k 

directions W ,- and the translations W 0 for each cluster that minimizes the global Fisher 

criterion. Let u/ = ((w* )' , ( Wq )' )' , the optimal representation for x is a 
weighted sum of the projections from different cluster: 
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F (x i )=£P(C k I x)(( W k ca ) T (x i -x k )-w k f + w k 0 ) 

k = 1 

= £ P i^-w k f +wl)=z i -w 



(4) 



where W T = ((V ) T , (w 2 ) T , ■ ■ • , ( W K ) T ) and zj = ((p^Z- f , p) , ■ ■ ■ , (p^ Z* f , p, ) ■ 
And the global intra-class and inter-class scatter can be represented as: 

s w = £ (r(x, ) - r' )ir(x, ) - r'- f=w r £ ( z , - 1' 1 )(z, - z' ) T w= w t m w w <5i 



S b =£N, (T'‘ - D(r'' - D r =w r £ N, (z - z)(z li - z) T w =w r M b w ( 6 ) 



1=1 



1=1 



_ N 

where is the mean of T(x) belonging to class/ and T = F(x y ) . The global 

i = 1 

Fisher criterion is to maximize the cluster weighted inter-class scatter while minimize 
the cluster weighted intra-class scatter, i.e. 



w = argmax 



w M b w 
w T Mw 



(7) 



It has close form solution and can be directly computed out using generalized eigen- 
decomposition algorithm [5]. 



Nonlinear Dimensionality Reduction for Classification: For a new data, the 
posterior probabilities for each cluster can be computed according to Eqn (3) and its 
low dimensional representation is obtained via the following nonlinear mapping 
functions in term of the derived local features in each cluster: 

M(x) = £ P(C k | x)(W k pca )\x-x k )- w f k + w** ) (8) 

k= 1 

It is an explicit nonlinear mapping function from the data space to the low 
dimensional space. The consequent classification can be conducted based on these 
low dimensional representations using the traditional approaches like Nearest 
Neighbor (NN) or Nearest Feature Line (NFL). In all our experiments, we used the 
NN for final classification. 
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3 Justifications 

Our proposed algorithm for discriminant analysis on embedded manifold (Daemon) 
consists of two steps: 1) separate the samples into class balanced clusters; and 2) 
merge the local discriminant analyzers into a global nonlinear discriminant analyzer 
by following the global Fisher criterion. It supervises the local analyzers and 
automatically decides the responsibility for each analyzer, which is somewhat like the 
background procedure named Daemon in UNIX system, thus this algorithm is 
referred as Daemon in the following. The intuition of Daemon is to merge the local 
discriminant analyzers into a unified framework; while it can be understood from a 
different perspective: it is a special kind of Kernel Discriminant Analysis algorithm, 
in which the kernel is data adaptive and geometry dependent; unlike other kernel 
machines that are independent to the samples they will analysis. In the following, we 
will discuss these two points in detail. 

Automatically Merging Local Discriminant Analyzers: In the first step, Daemon 
uses ICBKM for clustering and ICBKM ensures that the derived cluster has balanced 
samples for the classes, thus local discriminant analysis can be conducted in each 
cluster. Daemon merges these local discriminant analyzers by following global Fisher 
criterion. The local optimal directions in each cluster are dependent and computed out 
simultaneously, which is different from the traditional way to utilize clustering results 
in which local analyzers are independent. In the classification stage, these local 
analyzers are also dependent and the final representation is a weighted sum of the 
outputs from these local analyzers. The Eqn (8) can be also presented as: 



in which FE k (x) is the feature extractor in cluster k. As shown in Fig 3, the local 

analyzer has different optimal feature direction and it is locally discriminative; 
moreover, they can be merged and result in a globally nonlinear discriminant 
analyzer. 

Special Kernel Discriminant Analysis: Daemon follows the global Fisher 
criterion in the learning stage and intrinsically is a discriminant analysis algorithm; on 
the other hand, it is a special kind of Kernel Discriminant Analysis algorithm with 
geometry-adaptive-kernel. The traditional kernel machine is manually defined and 
independent to the sample data. As shown in Eqn (4), Daemon can be considered a 
process in which the training data is mapped into another data space { z} , and then 

LDA are conducted in the new feature space. Therefore, Daemon can be considered a 
special Kernel Discriminant Analysis algorithm and the kernel is: 



K 




(9) 



k=l 



k(x,y) = </>(x)-</>(y) 



(10) 



where </>(x) = z(x) =((p l z Y , p 1 ,■■■ ,(p K Z K Y , p K ) T in which z(x) is defined as in 
Eqn (4). The kernel has the following characteristics: 1) it has explicit mapping 
function from the input space to another feature space as the polynomial kernel does; 
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and 2) the kernel is dependent on the training samples and adaptive to the geometry 
structure. It can be solved just like a traditional Kernel Discriminant Analysis 
algorithm. Let the sample matrix X k = { (p(x ] ), (p(x 2 ),•••, <p(x N )) and the optimal 

feature direction <p = X k V , the problem is changed as follows: 



* I v T KMKv 

w = argmax i— 

W | v r KNKV 

L L 

where K= = k(x j ,x j ) , M = ^ /} — P , N = I — ^ P t ; and 

/=i 1=1 

Pj =-^e t eJ , P — -^-ee T , e is a vector with ones, e j(i) — , • 

It can be solved using generalized eigen-decomposition algorithm and the projection 
of a new data onto the discriminant direction is: 

N 

M k (x) = (p,0(x)) = '£v i k(x i ,x) ( 12 ) 

i = 1 



(ii) 



It’s obvious that X k V == W since Eqn (7) is equal to Eqn (11) when w is replaced 

by X k V . Consequently, M (x) in Eqn (8) and M k (x) in Eqn (12) are also equal. In 

other words, daemon is a Kernel Discriminant Analysis algorithm with explicit 
nonlinear mapping function from the input space to another feature space; moreover, 
as it is designed to be adaptive to the special data geometry structure, it should have 
strong ability to cope with the nonlinearly distributed data. 



4 Experiments 

In this section, we present two types of experiments to evaluate the Daemon 
algorithm. The experiment on toy problem of the synthesized data demonstrates the 
effective of ICBKM to derive discriminative feature in nonlinear classification 
problem; and the face recognition results on YALE and CMU PIE database shows 
that Daemon significantly outperforms LDA and has slightly better accuracy than 
traditional KDA. 



4.1 Toy Problem 

As shown in the upper-left image of Figure 2, the original data is composed of two 
classes of samples and they can not be separated linearly. They are synthesized 
according to the following function: 

Jx k = 0 . 03 */ + 8 
| y k = sin (kx . ) + k + S 



k=0, 1, S ~/V(0,0.1) 



( 13 ) 



130 



S. Yan et al. 



Daemon — Local Feature Directions 




Fig. 3. The derived local feature directions using Daemon 

We have systematically compared the clustering results of three K-Means-like 
algorithms. The original K-Means algorithm produced clustering results that aim at 
least sums of intra-cluster variances. As shown in the up-right image of Figure 2, the 
sample numbers for the classes in a cluster is not balanced and some clusters have 
only one class of samples. It makes the consequent local discriminant analysis 
impossible in these clusters. The cluster-balanced K-means algorithm produced 
similar result as that of the original K-means algorithm, yet, the sample number for 
each cluster is balanced. 

As shown in the down-right image of Figure 2, the clustering result from our 
proposed ICBKM algorithm has the following properties: 1) the sample numbers for 
the classes in a cluster are balanced, which fascinates the local discriminant analysis 
in each cluster; and 2) the two classes of samples in each cluster can be easily 
separated. It is obvious that our proposed ICBKM algorithm produced more useful 
clustering result than the other two methods and Intra-cluster balanced clustering 
result presents proper structure representation for the following analysis. The 
computed local feature direction in each cluster is illustrated in Figure 3. It shows that 
the local feature direction is approximately optimal for the samples in a cluster. 



4.2 Face Recognition 

In this subsection, the YALE [17] and PIE [15] databases are used for face 
recognition experiment. In both experiments, the face image is normalized by fixing 
the eyes in the same position and each pose uses a different position. Yale face 
database is constructed at the Yale Center for Computational Vision and Control. It 
contains 165 grayscale images of 15 individuals. For each individual, six faces are 
used for training, and the other five are used for testing. Table 1 illustrated the face 
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Table 1. Comparison between LDAs, Kemel-DA and Daemon on Yale database 



Algorithm 


Fisher-faces 


Kemel-DA 


Mixture- 

LDA 


Daemon(K=2) 


Accuracy 


80% 


84% 


82.7% 


88% 



Table 2. Comparison between Fisher-faces, Kernel-DA and Daemon on PIE database 



Algorithm 


Fisher-faces 


Kemel-DA 


Daemon(K=5) 


Accuracy(67 Dim) 


63.63% 


67.79% 


71.12% 



recognition results of the Daemon and LDA, KDA and Mixture LDA that trains 
different LDA model for each cluster from ICBKM. It shows that Daemon significant 
outperforms LDA and Mixture LDA, has better results than traditional KDA with 
Gaussian Kernel. 

We have also conducted the multi-view face recognition on the PIE database. We 
used the face images of pose 02, 37, 05, 27, 29 and 1 1 with out-plane view variation 
from -45° to 45 ° in our experiments. We averaged the results over 10 random splits. 
The experimental results illustrated in Table 2 again show that Daemon outperforms 
the other two algorithms. It is demonstrated again that Daemon has strong capability 
to handle nonlinear classification problems and can improve the accuracy in the 
general classification problems compared with LDA. 



5 Discussions and Future Directions 

We have presented a novel algorithm called Daemon for general nonlinear 
classification problem. Daemon is a nonlinear discriminant analysis algorithm in term 
of the embedded manifold structure. In this work, the discrete sample data on a 
manifold is clustered by our proposed Intra-Cluster Balanced K-Means algorithm 
such that the sample numbers for the classes in a cluster are balanced; and then the 
local optimal discriminant features are simultaneously derived by following the global 
Fisher Criterion. It is solved via general Eigen-decomposition algorithm. Daemon can 
be justified as an automatic merger of the local discriminant analyzers by following 
the global Fisher criterion; and it can be also justified as a special kernel discriminant 
analysis algorithm with geometry-adaptive-kernel. 

To the best of our knowledge, it is the first work to conduct discriminant analysis 
while explicitly considering the embedded geometry structure. In this work, we have 
only utilized the basic property of manifold that a manifold can be covered by a series 
of open sets; how to combine the other topology properties of a manifold with 
discriminant analysis for general classification problem is the future direction of our 
work, and we are considering it in theory and applications. 
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Abstract. We propose an efficient alignment method for textured Doo- 
Sabin subdivision surface templates. A variation of the inverse compo- 
sitional image alignment is derived by introducing smooth adjustments 
in the parametric space of the surface and relating them to the control 
point increments. The convergence properties of the proposed method 
are improved by a coarse-to-fine multiscale matching. The method is 
applied to real-time tracking of specially marked surfaces from a single 
camera view. 



1 Introduction 

Real-time tracking of textured surfaces in video is important for applications 
in user tracking and shape acquisition. In both of these applications active ap- 
pearance models (AAMs) have been employed in a wide variety. A template 
matching procedure is often used as a basic component of AAMs. In order to 
accommodate the geometry of the tracked object the template is formulated as 
a textured shape. Both polygonal meshes and smooth splines were used as the 
underlying surface representations [1] [2] . In this paper we develop a template 
tracking method for subdivision surfaces. Subdivision surfaces [3] offer a smooth 
and general representation of shape widely used in graphics and animation. 

Inverse compositional template matching was recently proposed for the ac- 
tive appearance models based on triangular meshes [1] [4] : it separately applies 
the current and incremental warps to the image and the template; the result is 
an efficient update procedure. For surface templates the separation of the incre- 
mental and current warps means that the incremental warp is performed in the 
parametric space of the surface. Thus, an additional difficulty arises on how to 
construct a space of such parametric warps. We propose a systematic approach 
to the construction of smooth atomic warps in the parameter space. Our deriva- 
tions are first done for the ideal case of smooth warps, and then approximated 
in the space of subdivision surfaces. 

The real-time operation of the template alignment procedure requires an 
efficient implementation. The natural multiresolution representation of subdivi- 
sion surfaces serves as a suitable framework for implementing a coarse-to-fine 
matching algorithm that improves the convergence properties of our method. 
The multiscale approach also overcomes the deficiencies of the single resolution 
template matching for discontinuous textures. 



T. Pajdla and J. Matas (Eds.): ECCV 2004, LNCS 3021, pp. 133—145, 2004. 
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Related work. Our work builds on the strengths of inverse compositional im- 
age alignment method developed by Baker and Matthews [4]. We extend their 
framework to handle templates specified on subdivision surfaces and propose a 
principled way of designing a set of smooth parametric adjustments (Section 3.3). 
The power of subdivision surfaces allow to represent big smooth patches of an 
arbitrary surface and control them with few control points. In this setting it is 
natural to consider coarse-to-fine version of the matching algorithm (Section 6) . 
This improves the convergence properties of the matching without resorting to 
eigen-tracking approaches. Thus, it places less restrictions on the space of al- 
lowable surface deformations and does not require an apriori knowledge of the 
surface deformation model. The multiscale approach has been applied success- 
fully for many related problems such as optical flow[5][6] and template search[7]. 

The goal of our work is similar to the Active Blobs effort [8]. While our 
method can handle general textured surfaces, it achieves its best tracking per- 
formance on specially quad-textured surfaces due to better control over con- 
ditioning of the involved matrix computations throughout all the levels of the 
hierarchy. In this paper, we only implement a simple appearance variation model 
and do not handle the detection of occluded regions and outlier pixels. Rather, 
the focus is on developing an efficient inverse compositional alignment procedure 
for general smooth surfaces described as subdivision models. 

The paper is organized as follows: Section 2 introduces compositional tem- 
plate matching, Section 3 talks about surface and warp representations used in 
our work, Section 4 contains the detailed description of the proposed method. 
Sections 5 and 6 cover the partial template matching and the multiscale ap- 
proach. Section 7 discusses the obtained results. 

2 Compositional Template Matching 

2.1 Forward Compositional Methods for Surfaces 

The goal of template matching is to find the best warp of a template to match 
a given image. In the scenario of surface tracking, the template is better treated 
as a function on the parametric space of the tracked surface. Formally, let E be 
the parametric space of the surface. Denote the projection of the surface in the 
image as S(f), so that S : E — > R 2 ; we shall call the function S a surface map. 
The template function T represents surface color, so that T : E — >■ C, where 
C = R for grayscale images and C = R 3 for color images. 

Given an image / : R 2 — > C , the matching problem consists of finding the 
surface map S which minimizes the error functional 

E(S) := \\IoS- T\\ 2 = JjI(S(0) - nO) 2 df. 

At this point we assume that all of the surface is visible, the case of partial 
visibility is considered in Section 5. We would also like to delay specifying a 
particular representation for the surface map S and treat it as a general smooth 
function in this section. 
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The two approaches to the template matching problem are the additive and 
compositional methods. Additive approach performs update S <r- S + dS , and 
finds the optimal surface adjustment dS by solving min d ,g ||J o (S + dS ) — Tj| 2 , 
The compositional approach updates the surface map via S' •<— S o W. It looks 
for the optimal warping in the parametric space: minw ||/ o S o W — Tj| , or in 
more detail: 

min (1) 

The two approaches can be shown to be equivalent when the incremental 
warp W : E — > E is close to the identity map so that W(£) ~ £ + d,W(£) and 
dW is small. Then the corresponding surface adjustment will be close to 

A Q 

Sow - S Ki —dW. (2) 

Details of the proof can be found in [4]. When the Jacobian dS/d £ is not full 
rank the two approaches are not equivalent. This happens, for instance, in the 
silhouette region of the surface map where the 2D tangential space gets projected 
onto a ID line in the image. 

The compositional approach is possible in its pure form when one can find a 
set of planar warps which form a group. When representing a general evolving 
surface, one requires more flexibility than present in the classical groups of trans- 
formations such as translations, affine transforms or homographies. On the other 
hand, the very general group formed by composition of arbitrary smooth maps 
used above is not practical. Our approach will be therefore to derive composi- 
tional methods for the general smooth case, and then approximate the needed 
computations in a smooth basis. 

2.2 Inverse Compositional Method 

We shall now derive the inverse compositional method [4] in the general case of 
smooth surface maps and warps. The basic assumption for equivalence between 
the forward and inverse compositional methods is the closeness of the incre- 
mental warp map to being the identity map. In particular, it is assumed that 
det(<9W/<9£) « 1. Then, the change of variable £ = W in (1) leads to the 
following minimization problem: 

min JjI(S(r])) - T(lT" 1 (r ? ))) 2 dr ? . 

The inverse of the incremental warp V := W~ x can be sought directly: 

min JjHSiv)) ~ T (V (p))) 2 dp . 

This approach results in less per frame computation than in the corresponding 
forwards methods as was shown in [4]. Once the incremental warp is found, we 
can update the surface map via S S o V~ x . 
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3 Template and Warp Representation 

The preceding section introduced the inverse compositional method in a general 
form. In practice, we need to use a specific representation for both the subdivision 
surface S and the parametric warp W. This section describes surface and warp 
representations used in our approach. 

3.1 Subdivision Surface Maps 

We define our template surface maps using Doo-Sabin subdivision scheme [9]. 
In this section, we give short description of our implementation. For a detailed 
introduction to subdivision surface modeling the reader is referred to [10] [3]. 

A subdivision surface is controlled by a control polygonal mesh Ad that has 
a set of control vertices V. For each control vertex k £ V, a planar position 
Pk £ R 2 is specified. The Doo-Sabin subdivision scheme is a dual scheme, so 
that its (dual) control vertices correspond to faces of a primal polygonal mesh 
AT. We restrict the primal mesh to be a manifold quad mesh possibly with 
boundary. 

The parametric space E of a subdivision surface map is formed as the union 
of square patches E/. ( k £ V) glued along the shared edges. Each patch is a 
square [0, 1] x [0, 1], and is associated to a particular primal face from Q' or the 
corresponding dual control vertex from V. A point £ in a parametric space is then 
fully described by the (dual) control vertex index k and a position in [0, 1] x [0, 1]. 
Given a function / : H — > R d , we shall use notation df/df; l ,i = 1,2 for its 
derivatives. This is well defined within patches away from the patch boundaries, 
which is sufficient for the purposes of this paper. 

We use primal-dual approach described in [11] 
to implement subdivision. The corner vertices are 
dependent on the boundary and inside vertices 
that share the same control face of the mesh; we 
also exclude the corner patches from the paramet- 
ric region of the template. Thus, the corner con- 
trol vertices only appear as an auxiliary dependent 
quantity; to simplify notation we redefine the set 
of control vertices V as the union of inside and 
boundary vertices in the remainder of this paper, with the parametric region E 
defined correspondingly to exclude the corner patches. 

Having defined parametric region E and the set of control vertices V, we 
define the C 1 subdivision surface map at any parametric position £ via 

%]( 0 '-Pk^iCh 

where the summation in k is assumed over every index in V, and pk s are two- 
dimensional points. 

The following properties of </> fc (£) are important: 

• Efcev Pk<l> k (Q = 1 for £ £ E, so that fc £ V form the partition of unity. 
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• Each </> fc (£) has local support in E that consists of the patch of k together 
with all the patches that share at least one patch corner with it. Thus, 
each parametric point £ is only affected by few control points whose support 
includes £. 

3.2 Control Mesh Extension 

In this section we describe a procedure for extrapolating the control mesh of 
the subdivision map from a given subset of (dual) control vertices with known 
positions to a wider set of control vertices adjacent to this known region. We 
shall use this procedure in two situations: creation of a canonical control position 
arrangement for the atomic warps as described in Section 3.3, and extension of 
active region positions for partial template matching as described in Section 5. 

Let Va be the set of active control vertices, and define the 
set Vb of non-active control vertices immediately adjacent 
to the set Va- Suppose that the positions of control vertices 
in both sets Va and Vb are known. We introduce three sets 
of control vertices, depending on their relative adjacency to 
the known vertex set, and define a procedure for extending 
positions in Va U Vb to these three sets. The figure on the 
right shows an example of vertex set assignment. 

• Vc is the set of control vertices (not in Va or Vb) whose primal faces share at 
least one primal control vertex with the primal control faces corresponding to 
control vertices in Va- In other words, a control vertex from Vc will belong to 
at least one dual control face whose vertices include at least one vertex from 
Va- Note that the same dual face will also include at least two vertices from 
Vb (this will be important for the extrapolation method described below). 

• Vd '■= bou(Wt U Vb) \ ( Va U Vb U Vc)- 1 Each vertex in Vd has at least one 
adjacent vertex vi , € Vb- By construction, the vertex v 0 on the opposite side 
of Vb will also be from Va U Vb- 

• Ve is the set of control vertices (not in Va U Vb U Vc U Vb) whose primal 
faces share at least one primal control vertex with the primal control faces 
corresponding to control vertices in Vb- 

For a vertex in Vc take a dual face that has at least three known vertices. 
Let this dual control face have n vertices, and index its corners with integers 
i = 0, . . . , n — 1 in a counterclockwise order. We associate the vertex i with the 
parameter value a* = 27r*/n, and find the ellipse p(a) = C + D\ cos a + D 2 sin a 
that best fit the known points in the least square sense. We then assign all the 
unknown vertex positions on the ellipse at the appropriate a locations. In the 
regular case of three known vertices and a single unknown vertex, we obtain a 
parallelogram rule. If there are several prediction for a vertex in Vc we compute 
their average. 

1 For a subset of vertices U C V we define its boundary bou (U) as the set of vertices 
from V\IA adjacent to at least one vertex in U. 
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Now for each vertex in Vd we extrapolate its value linearly via p(vd) = 
2 p(vb ) — p(v 0 ) where Vb and v D are as described above. After this step, all the 
vertices in Va U Vb U Vc U Vd are assigned a position. We can now replicate 
the extrapolation step of Vc for all the vertices in Vs, which concludes our 
extrapolation procedure. 

The described procedure guarantees the position assignment for all the con- 
trol vertices that affect the subdivision map values within parametric patches 
associated with Va U Vb ■ This will be especially important for the partial tem- 
plate matching of Section 5. 



3.3 Warp Space 

For a surface map represented as S[p](£) = 4> k {^)pk, the control point adjust- 
ments Ap result in the surface sample positions changed by <jr{g)Ap k . For the 
compositional approach to work, this surface adjustment has to match the sur- 
face adjustment obtained via warping. Any smooth parameterized warp map 
W(£; q) such that W(£; 0) = £ can be written: 



W(£; <?) = £ + 



dW& 0) 

dq 



q + 0(q 2 ). 



Define 7 $*(£) := cW*(£; 0 )/dq\ . For the purpose of template matching, these 
first derivatives are all that is necessary to define explicitly. 

For a fixed index K, we choose in such a way that for some configuration 
of control points p K the surface map update corresponding to the parametric 
control point qx matches the surface update coming from the control point 
adjustment Apx- More precisely, we would like that for all £ £ E: 



4> k (0M = 



dspurm kj 






t vmi 



k j 



when Ap l k = 5 k Z l and q k = Sjf Z 1 ' (here Z is an arbitrary two-dimensional 
vector). It follows that (no summation on the capital index K ): 



= 



ds*ip K m Kj 






7^(6- 



The quantity on the left hand side is a scaled identity matrix, therefore we can 
conclude that 7 l v 7 (£) is the inverse of the 2x2 matrix dS[p K ](t;)/dt; times the 
scalar value (f > K (£ ;) at each £. Hence we set 



7^(6 



(«)'V(0. 



(3) 



Each 7^ J (C) defined above is a local function, non-zero on the support of the 
basis function </> fc (£)- 
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We now need to define p K (for this canonic arrangement of control points 
around a vertex K the warped surface adjustment will have an exact represen- 
tation in the basis of subdivision shapes). Note that only evaluation of S\p K ] 
within the support of a single basis function <(> fc (£) is required. We use the control 
mesh extension process from Section 3.2 with the active set Va consisting of a 
single vertex K £ V. The four immediate neighbors of K form the set Vb , and 
we assign the five control points values on the plane so that p % is at the ori- 
gin, and the points from Vb are positioned at (1,0), (0, 1), (—1,0), and (0,-1). 
The extension procedure then defines all the other positions required to evaluate 
S[p K ] on the patches near K . In our experience, this results in non-degenerate 
assignments of p K . 



4 Inverse Compositional Method for Subdivision Maps 



4.1 Parametric Adjustment 

We can now find the optimal parameters q of the parametric adjustment, by 
minimizing the fit functional with respect to q. 

j(q) ■■= J_\m P m-nw^q))\ 2 ^ 

We introduce the pointwise error of the current fit as E(£: p) := 7(,S[p](£)) — T(£), 
and obtain the following approximation to J(q)'. 



J(q ) » 



m—1 1 



Em&p) ~ -^(07^(0 ql 






where the subscript m = 1, 2, 3 denotes the appropriate color channel. 

Differentiating this expression with respect to q l k and introducing the nota- 
tion (£) := 7,v*(£)ST m /c>C(0 we get the following system of linear equations 
for the optimal parametric adjustment parameters q: 



j_ = j_ h*\€)Em(bp)dt, keV,i’ = 1 , 2 . 



As is expected from an inverse compositional method, the matrix = 

f- h ^ (£)^m (&d£ on the left hand side does not depend on p and its inverse can 
be precomputed, while the right hand side bu 1 = f E (QE m (ti',p)dfi depends 
on the current error of the fit, and has to be recomputed during optimization. 

The linear system Aq = b has 2|V/| unknowns, and in order to guarantee its 
proper solution we need to ensure that the surface template has enough edge 
features of different orientations (similar to [12]). In order to ensure more robust 
tracking we apply the multiscale template: in this case we need to ensure that the 
edge features exist at all the resolutions. Surfaces marked with quad patterns do 
provide such features as long as the surface patches controlled by a single vertex 
covers a few of the pattern quads (see Figure 1 for the comparison of condition 
numbers of the matrix A = H l H). 
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4.2 Evaluation of Control Vertex Adjustments 

Once the parametric adjustment is found, we compute the surface sample dis- 
placements that correspond to the approximate inverse of the parametric dis- 
placement, namely a surface sample 5[p](£) is moved to 5[p](W(^; —q)) where q 
is the optimal parametric displacement parameters found in the previous section. 
We find an approximation to the actual samples movement via 

-<?)) « sm - « %ko - ^(o <?L 

so that the sample with parameter £ undergoes the displacement 

*(£;<?) : =-^r(07 

note that dS\p]/d£, depends on the current p which makes the update step non- 
linear [1], 

In order to find the approximation to the appropriate displacement of control 
vertices of the subdivision surface we sample the surface displacement on a fixed 
four-by-four grid of parameter samples within each quad patch, and solve for the 
corresponding control vertex displacements that are optimal in the least square 
sense. 

Denote the discrete set of all the sampled surface displacement parameters as 
Es- We need to find Ap such that the following set of constraint is approximately 
satisfied, that is 5[Ap](£) = er(£; q) for all £ £ Es- Using the expression for the 
subdivision map we obtain Apk4> k (C) = c(£; <?),£ £ ~S- Introduce the matrix of 
basis function sample values ($s)£v := ^(£),u £ V/,£ £ Eg- The least-squares 
solution the above linear system then gives us the following expression for Ap: 

Ap= (<!> t s <l>s)~ 1 & t s cr (q)i 

and the matrix inverse on the right-hand side can be precomputed during system 
initialization. Once the optimal control point adjustment is found, we update 
the control point positions using p( n+1 ) = j/ r 0 + Ap which concludes a single 
iteration of our template alignment procedure. 

Appearance variation. A simple constant appearance variation model can be 
added as in [4] by changing the pointwise fit error to be E(£;p) := (/(5[p](£)) — 
I average) — T*(£), where I average is an estimate of the average color value of 
image samples of the current surface, and T*(£) := T(J;) — T average is the original 
template adjusted so that its average is zero. 

5 Partial Template Matching 

When some part of the surface is occluded, it is no longer possible to track its 
motion, and the template matching should exclude the corresponding control 
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vertices. At the same time, it would not be practical to create a completely sep- 
arate surface template from scratch. Rather we would like to reuse a partially 
active template. In this section, we describe how the template matching algo- 
rithm described above can be modified when only a subset of control vertices 
are allowed to change independently. 

We assume that a subset Va of active vertices is given (one can think that 
the surface portion S'dJfcev.i Sk) is fully visible); and take Vb = bou(V J 4). The 
union of Va and Vb is therefore used as the set of independent control vertices 
found by template matching, we call it Vab '■= Va U Vb. Thus, the surface is 
determined by vertices in Vab ; we modify the set of warp parameters Aq in 
the same way, and all the linear systems solved in the algorithm will be of such 
reduced dimensions. 

Another modification to the algorithm is the modifica- 
tion of the integration region S. Denote by £ab the set of 
all the edges with both ends in Vab- With each edge (a, b) 
we associate the union S(a,b) of two triangular sections 
of the corresponding patches S a and Sb, as shown in the 
figure to the right. The integration region Sab used in the partial matching 
algorithm is defined to be Sab = ( J( a b)&s AB &)■ Therefore, the surface map 
needs to be evaluated on samples within Sab at every step of matching. This is 
only possible when a wider set of control vertex positions is known; the extension 
procedure from Section 3.2 is used for extrapolating vertex positions from the 
set Vab to the wider set V ex t required for the surface evaluation on Sab- 

6 Tracking Quad Patterns with the Multiscale Matching 

The derivation of the template alignment method of the previous section relied 
on the fact that we could take derivatives of the template function and apply 
basic first order approximations. We would like to apply this method to tracking 
colored quad patterns similar to the ones used in [13]. Those patterns considered 
as functions are not even continuous. In this section, we discuss the issues that 
arise from this complication and our approach to overcoming them. We start by 
analyzing a simple one-dimensional example, and then discuss the implications 
of this analysis for the original surface tracking case. 

One- dimensional example. Assume that our template is the step function 2 : 
T(x) = xi x )i an d consider the error functional J(p) := J |T(a; + p) — I(x) | 2 dx. 
If the image function is a shifted step function, that is I{x) = \{x + a) for some 
a, the error functional is not smooth: J(p) = \p — a\. It follows that the gradient 
based methods may not perform well on such an optimization problem. This can 
be noticed when we apply template matching on sharp images with discontinu- 
ities: the adjustment to the template parameters p coming from the gradient 
descent method will stop decreasing as it approaches the optimal value. 

Define xOr) = 0 for x < 0 and x( x ) = 1 f° r x > 0 
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In practice we work with a discretized version of the template, so all the 
derivatives can be evaluated as divided differences. Consider the template at the 
grid step h: 



{ 0, x < —h/2 

1/2 + x/h, — h/2 < x < h/2 
1, x > h/2 



The minimization of the functional Jh{p) '■= J \Th{x + p) — I(x)\ 2 dx leads to 
the following expression for the optimal translation of the template: 



Ph = 



J (I- T h )T' h dx / J (T' h ) 2 dx = J ' (I - T h )dx. 



Hence the upper bound on the parameter adjustment is proportional to the step 
h: \p* h \ < h\\I — ThHoo < 2/imax{||/H 00 , 1}. Thus, for large adjustments in p we 
need to employ coarse versions of the template, while small h are preferable for 
the precise positioning of the template. It therefore makes sense to proceed from 
coarse to fine discretizations. This is similar in spirit to multiscale optical flow 
and template matching algorithms [6] [7] [14]. 

We apply the coarse-to-fine approach to our surface template matching al- 
gorithm. To illustrate its convergence properties at different resolutions we plot 
the mean-square error of a template fit with respect to the number of iterations 
for the Quad sheet model (see Figure 1). It is clear that the coarse template 
matching makes larger adjustments towards the minumum but the precision of 
the result is limited. At the same time a finer template matching is able to 
recover the minumum with high precision but requires more iterations, and is 
also more intensive computationally. The combined method shown in the plot 





Fig. 1 . Left: mean-square error during iterations of template matching procedures at 
different levels of template resolution. Right: comparison of condition numbers of 
matrices at different levels of resolution. For quad patterns the condition number stays 
relatively low on all the levels, while for natural patterns it is less controlled. 
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Fig. 2. Various surface templates used. See accompanying video for tracking examples 



uses three iterations on each level starting from the coarsest, and achieves better 
convergence at lower cost. The moderate condition number of the matrix 11*11 
plotted in Figure 1 also contributes to the success of the multiscale method for 
quad-marked patterns. 

In the case of a piecewise constant quad-marked pattern, the only template 
samples participating in template matching are located along the discontinuities. 
Thus, when the template subdivision level l is increased, the number of sampled 
points will only increase linearly with n rather than quadratically (as n 2 where 
n — 2 l ) as is the case for a template with globally non-trivial gradient. This 
compression of the template improves the efficiency of the matching (see the left 
two templates in Figure 2, the yellow dots indicate the used template samples). 



7 Results 



We have implemented our template matching algorithm for Doo-Sabin subdivi- 
sion surfaces within a real-time tracking application framework. After manual 
initialization, the template is tracked in video sequence. A linear temporal pre- 
diction scheme is employed to obtain the initial guess for the template positioning 
in every consecutive frame of video. The application was able to perform at 30 
frames per second for all the full surface tracking examples presented below on a 
2GHz Pentium laptop. The video was acquired with a digital camera at 640x480 
resolution. We used the combined multiscale method that ran two iterations of 
matching on each level of resolution on levels two to four, and a single iteration 
on level five. The below table shows the number of inside and boundary control 
vertices in the models used for this paper. The accompanying video contains 
video sequences captured in real time. 



Name 


Number of CVs 


Number of active 
inside CVs 


Number of active 
boundary CVs 


Texture type 


Quad sheet 


12 


4 


8 


pure 


Ball 


9 


3 


6 


pure 


Shirt 2x1 


8 


2 


6 


acquired 


Elbow 3x1 


11 


3 


8 


acquired 


Rock 


5 


1 


4 


acquired 


Partial shirt 


21 


5 


10 


pure 
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We have used both predefined pure quad template patterns and the textures 
acquired from the video frame during the initialization process. The pure pat- 
terns result in the sparse pattern for the participating samples along the quad 
edges; for the acquired textures we used thresholding on the /i^(£) coefficients to 
determine which samples should be participating in the integral discretization. 

For the partial tracking example, we used a five by five grid template for 
the t-shirt. The active region included five vertices as shown in Figure 2. This 
example runs at 15 frames per second. 

8 Future Work 

We presented a multiscale method for matching subdivision surface templates. 
The future work will need to address the maintenance of the active visible region 
for partial surface tracking as well as the automatic initialization procedure. The 
compositional methods work within the surface and cannot account for surface 
displacement near its silhouettes from a single view. A multi-view extension 
of the presented procedure can help alleviate this problem. A more complex 
appearance variation modeling is also left as a future work direction. 



Acknowledgments. This work was partially supported by NSF(CCR-0133554) 
and University of Michigan AI Lab. 



References 

1. Matthews, I., Baker, S.: Active appearance models revisited. Technical Report 
CMU-RI-TR-03-02, Robotics Institute, Carnegie Mellon University, Pittsburgh, 
PA (2003) 

2. Cascia, M.L., Sclaroff, S., Athitsos, V.: Fast, reliable head tracking under vary- 
ing illumination: An approach based on robust registration of texture-mapped 3d 
models. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI) 22 (2000) 

3. Zorin, D., Schroder, P., eds.: Subdivision for Modeling and Animation. Course 
Notes. ACM SIGGRAPH (1999) 

4. Baker, S., Matthews, I.: Equivalence and efficiency of image alignment algorithms. 
In: Proc. of the CVPR. (2001) 

5. Szeliski, R., Shum, H.Y.: Motion estimation with quadtree splines. IEEE Trans- 
actions on Pattern Analysis and Machine Intelligence 18 (1996) 1199-1210 

6. Simoncelli, E.: Bayesian multi-scale differential optical flow. In: Handbook of 
Computer Vision and Applications. (1993) 128-129 

7. Borgefors, G.: Hierarchical chamfer matching: A parametric edge matching algo- 
rithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 10 (1988) 
849-865 

8. Sclaroff, S., Isidoro, J.: Active blobs. In: Proceedings of ICCV 98. (1998) 1146-1153 

9. Doo, D., Sabin, M.: Behaviour of recursive division surfaces near extraordinary 
points. Computer-Aided Design 10 (1978) 356-360 

10. Warren, J., Weimer, H.: Subdivision Methods For Geometric Design: A Construc- 
tive Approach. Morgan Kaufmann (2001) 



Multiscale Inverse Compositional Alignment for Subdivision Surface Maps 145 



11. Zorin, D., Schroder, P.: A unified framework for primal/dual quadrilateral subdi- 
vision schemes. CAGD 18 (2001) 429-454 

12. Shi, J., Tomasi, C.: Good features to track. In: CVPR. (1994) 593-600 

13. Guskov, I., Klibanov, S., Bryant, B.: Trackable surfaces. In: Proceedings of 

ACM/EG Symposium on Computer Animation. (2003) 251-257 

14. Gleicher, M.: Projective registration with difference decomposition. In: Proc. of 
IEEE CVPR 1997. (1997) 331-337 



A Fourier Theory for Cast Shadows 



Ravi Ramamoorthi 1 , Melissa Koudelka 2 , and Peter Belhumeur 1 

1 Columbia University, {ravir ,belhumeur}@cs . columbia.edu 
2 Yale University, melissa . koudelka@yale . edu 



Abstract. Cast shadows can be significant in many computer vision applications 
such as lighting-insensitive recognition and surface reconstruction. However, most 
algorithms neglect them, primarily because they involve non-local interactions in 
non-convex regions, making formal analysis difficult. While general cast shad- 
owing situations can be arbitrarily complex, many real instances map closely to 
canonical configurations like a wall, a V-groove type structure, or a pitted surface. 
In particular, we experiment on 3D textures like moss, gravel and a kitchen sponge, 
whose surfaces include canonical cast shadowing situations like V-grooves. This 
paper shows theoretically that many shadowing configurations can be mathemat- 
ically analyzed using convolutions and Fourier basis functions. Our analysis ex- 
poses the mathematical convolution structure of cast shadows, and shows strong 
connections to recently developed signal-processing frameworks for reflection 
and illumination. An analytic convolution formula is derived for a 2D V-groove, 
which is shown to correspond closely to many common shadowing situations, 
especially in 3D textures. Numerical simulation is used to extend these results 
to general 3D textures. These results also provide evidence that a common set 
of illumination basis functions may be appropriate for representing lighting vari- 
ability due to cast shadows in many 3D textures. We derive a new analytic basis 
suited for 3D textures to represent illumination on the hemisphere, with some 
advantages over commonly used Zernike polynomials and spherical harmonics. 
New experiments on analyzing the variability in appearance of real 3D textures 
with illumination motivate and validate our theoretical analysis. Empirical results 
show that illumination eigenfunctions often correspond closely to Fourier bases, 
while the eigenvalues drop off significantly slower than those for irradiance on 
a Lambertian curved surface. These new empirical results are explained in this 
paper, based on our theory. 



1 Introduction 

Cast shadows are an important feature of appearance. For instance, buildings may cause 
the sun to cast shadows on the ground, the nose can cast a shadow onto the face, and 
local concavities in rough surfaces or textures can lead to interesting shadowing effects. 
However, most current vision algorithms do not explicitly consider cast shadows. The 
primary reason is the difficulty in formally analyzing them, since cast shadows involve 
non-local interactions in concave regions. 

In general, shadowing can be very complicated, such as sunlight passing through the 
leaves of a tree, and mathematical analysis seems hopeless. However, we believe many 
common shadowing situations have simpler structures, some of which are illustrated 
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Fig. 1 . Four common shadowing situations. We show that these all have similar structures, 
amenable to treatment using convolution and Fourier analysis. The red lines indicate extremal 
rays, corresponding to shadow boundaries for distant light sources. 

in Figure 1 . From left to right, shadowing by a wall, a V-groove like structure, a plane 
such as a desk above, and a pitted or curved surface. Though the figure is in 2D, similar 
patterns often apply in 3D along the radial direction, with little change in the extent of 
shadowing along transverse or azimuthal directions. 

Our theory is motivated by some surprising practical results. In particular, we focus 
on the appearance of natural 3D textures like moss, gravel and kitchen sponge, shown 
in Figures 2 and 6. These objects have fine-scale structures similar to the canonical 
configurations shown in Figure 1 . Hence, they exhibit interesting illumination and view- 
dependence, which is often described using a bi-directional texture function (BTF) [3] . In 
this paper, we analyze lighting variability, assuming fixed view. Since these surfaces are 
nearly flat and diffuse, one might expect illumination variation to correspond to simple 
Lambertian cosine-dependence. However, cast shadows play a major role, leading to 
effects that are quantitatively described and mathematically explained here. 

We show that in many canonical cases, cast shadows have a simple convolution 
structure, amenable to Fourier analysis. This indicates a strong link between the math- 
ematical properties of visibility, and those of reflection and illumination (but ignoring 
cast shadows) for which Basri and Jacobs [1], and Ramamoorthi and Hanrahan [15, 
16], have recently derived signal-processing frameworks. In particular, they [1,15] show 
that the irradiance is a convolution of the lighting and the clamped cosine Lambertian 
reflection function. We derive an analogous result for cast shadows, as convolution of 
the lighting with a Heaviside step function. Our results also generalize Soler and Sil- 
lion’s [19] convolution result for shadows when source, blocker and receiver are all in 
parallel planes — for instance, V-grooves (b in Figure 1, as well as a and d) do not contain 
any parallel planes. Our specific technical contributions include the following: 

- We derive an analytic convolution formula for a 2D V-groove, and show that it 
applies to many canonical shadowing situations, such as those in Figure 1. 

- We analyze the illumination eigenmodes, showing how they correspond closely 
to Fourier basis functions. We also analyze the eigenvalue spectrum, discussing 
similarities and differences with convolution results for Lambertian curved surfaces 
and irradiance, and showing why the falloff is slower in the case of cast shadows. 

- We explain important lighting effects in 3D textures, documented quantitatively here 
for the first time. Experimental results confirm the theoretical analysis. 

- We introduce new illumination basis functions over the hemisphere for lighting 
variability due to cast shadows in 3D textures, potentially applicable to compression, 
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interpolation and prediction. These bases are based on analytic results and numerical 
simulation, and validated by empirical results. They have some advantages over the 
commonly used spherical harmonics and Zernike polynomials. 

Our paper builds on a rich history of previous work on reflection models, such as 
Oren-Nayar [12], Torrance-Sparrow [21], Wolff et al. [23] and Koenderink et al. [6], 
as well as several recent articles on the properties of 3D textures [2,20]. Our analytic 
formulae are derived considering the standard V-grooves used in many of these previous 
reflection models [12,21]. Note that many of these models include a complete analysis 
of visibility in V-grooves or similar structures, for any single light source direction. 
We differ in considering cast shadows because of complex illumination, deriving a 
convolution framework, and analyzing the eigenstructure of visibility. Our work also 
relates to recent approaches to real-time rendering, such as the precomputed transfer 
method of Sloan et al. [18], that represents appearance effects including cast shadows, 
due to low-frequency illumination, represented in spherical harmonics. However, there is 
no analytic convolution formula or insight in their work as to the optimal basis functions 
or the number of terms needed for good approximation. We seek to put future real-time 
rendering methods on a strong theoretical footing by formalizing the idea of convolution 
for cast shadows, analyzing the form of the eigenvalue spectrum, showing that the decay 
is much slower than for Lambertian irradiance, and that we therefore need many more 
basis functions to capture sharp shadows than the low order spherical harmonics and 
polynomials used by Sloan et al. [18] and Malzbender et al. [11]. 



2 The Structure of Cast Shadows 

In this section, we briefly discuss the structure of cast shadows, followed in the next 
section by a derivation of an analytic convolution formula for a 2D V-groove, Fourier 
and principal component analysis, and initial experimental observations and validation. 

First, we briefly make some theoretical observations. Consider Figures 1 a and b. 
There is a single extreme point B. As we move from O to A! to A (with the extremal rays 
being OB, A'B and AB), the visible region of the i 1 1 um i nation monotonically increases. 
This local shadowing situation, with a single extreme point B, and monotonic variation 
of the visible region of the illumination as one moves along the surface, is one of the 
main ideas in our derivation. Furthermore, multiple extreme points or blockers can often 
be handled independently. For instance, in Figures 1 c and d, we have two extreme points 
B and C. The net shadowing effect is essentially just the superposition of the effects of 
extreme rays through B and C. 

Second, we describe some new experimental results on the variability of appearance 
in 3D textures with illumination, a major component of which are cast shadowing in- 
teractions similar to the canonical examples in Figure 1 . In Figure 2, we show an initial 
experiment. We illuminated a sample of gravel along an arc (angle ranged from —90° to 
+64°, limited by specifics of the acquisition). The varying appearance with illumination 
clearly suggests cast shadows are an important visual feature. The figure also shows a 
conceptual diagrammatic representation of the profile of a cross-section of the surface, 
with many points shadowed in a manner similar to Figure 1 (a), (b) and (d). 
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Fig. 2 . (a): Gravel texture, which exhibits strong shadowing, (b): Images with different light direc- 
tions clearly show cast shadow appearance effects, especially at large angles. The light directions 
correspond to the red marks in (d). (c): Conceptual representation of a profile of a cross section 
through surface (drawn in black in a), (d): Schematic of experimental setup. 



3 2D Analysis of Cast Shadows 

For mathematical analysis, we begin in flatland, i.e., a 2D slice through the viewpoint. 
We will consider a V-groove model, shown in Figure 3, corresponding to Figure 1 b. 
However, the derivation will be similar for any other shadowing situation, such as those in 
Figure 1, where the visibility is locally monotonically changing. Note that the V-groove 
model in Figure 3 can model the examples in Figures 1 a and b (Pi = 0 , 02 = tt/ 2 and 
Pi = P 2 ), and each of the extreme points of Figures 1 c and d. 




Fig. 3. Diagram of V-groove with groove angle ranging from —pi to +P2. While the figure shows 
Pi = P2, as is common for previous V-groove models, there is no requirement of symmetry. 
We will be interested in visibility for points A(x) where x is the distance along the groove (the 
labels are as in Figure 1 ; the line A’B is omitted for clarity). Note that the visible region of A(x), 
determined by a(x), increases monotonically with x along the groove. 
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3.1 Convolution Formula for Shadows in a V-Groove 

Our goal is to find theirradiance 1 E(x,fa as a function of groove angle / 3 = \—fa, +fa], 
and the distance along the groove x. Without loss of generality, we consider the right 
side of the groove only. The left side can be treated similarly. For a particular groove 
(fixed fa, pixels in a single image correspond directly to different values of x, and the 
irradiance E(x) is directly proportional to pixel brightness. 

,7r/2 

E(x,fa = / L(faV(x,u,fa du, (1) 

J-tv/2 

where L(fa is the incident illumination intensity, which is a function of the incident 
direction uj. We make no restrictions on the lighting, except that it is assumed distant, 
so the angle to does not depend on location x. This is a standard assumption in environ- 
ment map rendering in graphics, and has been used in previous derivations of analytic 
convolution formulae 1 1,16]. V is the binary visibility in direction u> at location A(x). 

Monotonic Variation of Visibility: As per the geometry in Figure 3, the visibility is 1 
in the range from —fa to fa + a(x) and 0 or (cast) shadowed otherwise. It is important 
to note that a(x) is a monotonically increasing function of x, i.e., the portion of the 
illumination visible increases as one moves along the right side of the groove from O to 
A! to A (with corresponding extremal rays OB, A' B and AB). 

Reparameterization by a: We now simply use a to parameterize the V-groove. This is 
just a change of variables, and is valid as long as a monotonically varies with x. Locally, 
a is always proportional to x, since we may do a local Taylor series expansion, keeping 
only the first or linear term. 

Representation of Visibility: We may now write down the function V(x,u), fa newly 
reparameterized as V (a, u>, fa. Noting that V is 1 only in the range from [~~Pufa + <*], 

V (a, co,fa = H (-fa -fa - H ((fa + a) - fa , 

H(u) = 1 if u < 0, 0 if u > 0, (2) 

where H ( u ) is the Fleaviside step function. The first term on the right hand side zeros the 
visibility when u> < —fa and the second term when u> > fa + a. Figure 4 illustrates this 
diagrammatically. In the limit of a perfectly flat Lambertian surface, fa = fa = tt/2, 
and a = 0. In that case, the first term on the right of Equation 2 is always 1, the second 
term is 0, and V = 1 (no cast shadowing), as expected. 

For a particular groove (fixed fa, V is given by the following intervals. 



— 7t/2 < u < —fa 


V = 0 


independent of a 




-fa <u < +fa 


V = 1 


independent of a 




+fa < to < fa + a 


V = 1 


interval depends on a 




fa + a < to < 7t/2 


v = o 


interval depends on a. 


(3) 


Since we focus on cast shadows, we 


will assume Lambertian surfaces, and will neglect the 



incident cosine term. This cosine term may be folded into the illumination function if desired, 
as the surface normal over a particular face (side) of the V-groove is constant. 



A Fourier Theory for Cast Shadows 151 
V(a ,co,(3) = H(-p,-a>) - H((p 2 + a )-co) 



-71/2 -p, +p, p,+a +77/2 -7C/2 -p, +p 2 p 2 +a +71/2 -71/2 -p, +P 2 P 2 +a +71/2 

(0 (0 0) 

Fig. 4. Illustration of the visibility function as per Equation 2. The black portions of the graphs 
where u> < +02 are independent of a or groove location, while the red portions with a > +P 2 
vary linearly with a, leading to the convolution structure. 

Convolution Formula: Plugging Equation 2 back into Equation 1 , we obtain 

fir / 2 fir / 2 

E(a,/3)= / L{ui)H{— pi — u>) duj — / L(u>)H((P2 + ct) — oj) du. (4) 

J- tt /2 J-ir/2 

E is the sum of two terms, the first of which depends only on groove angle /3i, and the 
second that also depends on groove location or image position a. In the limit of a flat 
diffuse surface, the second term vanishes, while the first corresponds to convolution with 
unity, and is simply the (unshadowed) irradiance or integral of the illumination. We now 
separate the two terms to simplify this result as (<g> is the convolution operator) 

E(a,(3)=E(-/3 1 )-E(fo + a) 
fir I" 1 

E(u) = / L(u)H(u -u)du = L®H. (5) 

J- tt/2 

Fourier Analysis: Equation 5 makes clear that the net visibility or irradiance is a simple 
convolution of the incident illumination with the Heaviside step function that accounts 
for cast shadow effects. This is our main analytic result, deriving a new convolution 
formula that sheds theoretical insight on the structure of cast shadows. It is therefore 
natural to also derive a product formula in the Fourier or frequency domain, 

E k = VnL k H kl (6) 

where L k are the Fourier illumination coefficients, and H k are Fourier coefficients of 
the Heaviside step function, plotted in Figure 5. The even coefficients H 2 k vanish, while 
the odd coefficients decay as l/k. The analytic formula is 

k = 0: H 0 
odd k : H k 





3.2 Eigenvalue Spectrum and Illumination Eigenmodes for Cast Shadows 

Our convolution formula is conceptually quite similar to the convolution formula and 
signal-processing analysis done for convex curved Fambertian surfaces or irradiance 
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Heaviside step function 




8 10 12 14 



Singular value number (k) 



Clamped Cosine (Lambertian) Loglog plot of nonzero eigenvalues 




Fig. 5. Comparison of Fourier coefficients for the Heaviside step function for cast shadows (left) 
and the clamped cosine Lambertian function for irradiance (middle). For the step function, even 
terms vanish, while odd terms decay as 1 /k. For the clamped cosine, odd terms greater than 1 
vanish, while even terms decay much faster as 1/k 2 . On the right is a loglog plot of the absolute 
values of the nonzero eigenvalues. The graphs are straight lines with slope -1 for cast shadows, 
compared to the quadratic decay (slope -2) for irradiance. 

by Basri and Jacobs [1] and Ramamoorthi and Hanrahan [14,15], In this subsection, 
we analyze our results further in terms of the illumination eigenmodes that indicate the 
lighting distributions that have the most effect, and the corresponding eigenvalues or 
singular values that determine the relative importance of the modes. We also compare 
to similar analyses for irradiance on a curved surface [1,13,14,15]. 

Illumination eigenmodes are usually found empirically by considering the SVD of a 
large number of images under different (directional source) illuminations, as in lighting- 
insensitive face and object recognition [4,5]. It seems intuitive in our case that the 
eigenfunctions will be sines and cosines. To formalize this analytically, we must relate 
the convolution formula above, that applies to a single image with complex illumination, 
to the eigenfunctions derived from a number of images taken assuming directional source 
lighting. Our approach is conceptually similar to Ramamoorthi’s work on analytic PCA 
construction [13] for images of a convex curved Lambertian object. 

Specifically, we analyze V (a, u>, (3) for a particular groove (fixed (3). Then, V (a, u>) 
is a matrix with rows corresponding to groove locations (image pixels) a and columns 
corresponding to illumination directions uj. A singular-value decomposition (SVD) will 
give the eigenvalues (singular values) and illumination eigenmodes. It can be formally 
shown (details omitted here) that the following results hold, as expected. 

Eigenvalue Spectrum: The eigenvalues decay as 1/k, corresponding to the Heav- 
iside coefficients, as shown in Figure 5. Because of the relatively slow 1/k decay, we 
need quite high frequencies (many terms) for good approximation of cast shadows. On 
the other hand 2 , for irradiance on a convex curved surface, we convolve with the clamped 
cosine function max(cos 0, 0) whose Fourier coefficients falloff quadratically as 1/fc 2 , 
with very few terms needed for accurate representation [1,15]. 

In actual experiments on 3D textures, the eigenvalues decay somewhat faster. First, 
as explained in section 4.1, the eigenvalues for cast shadows decay as 1/k 3 ^ 2 (loglog 

2 The Heaviside function has a position or C° discontinuity at the step, while the clamped cosine 

has a derivative or C 1 discontinuity at cos 9 — 0. It is known in Fourier analysis [10] that a C 71 
discontinuity will generally result in a spectrum that falls off as 1 /k n+1 . 
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slope - 1 .5) in 3D, as opposed to 1 /k in 2D. Second, in the Lambertian case, since we are 
dealing with flat, as opposed to spherical surfaces, the eigenvalues for irradiance drop 
off much faster than 1 /A: 2 . In fact, for an ideal flat diffuse surface, all of the energy is in 
the first eigenmode, that corresponds simply to Lambertian cosine-dependence. 

Illumination Eigenmodes: The illumination eigenmodes are simply Fourier basis 
functions — sines and cosines. This is the case for irradiance on a curved surface in 2D 
as well [14], reinforcing the mathematically similar convolution structure. 

Implications: There are many potential implications of these results, to explain empir- 
ical observations and devise practical algorithms. For instance, it has been shown [15] 
that illumination estimation from a convex Lambertian surface is ill-posed since only 
the first two orders can be estimated. But Sato et al. [17] have shown that illumination 
can often be estimated from cast shadows. Our results explain why it is feasible to es- 
timate much higher frequencies of the illumination from the effects of cast shadows. 
In lighting-insensitive recognition, there has been much work on low-dimensional sub- 
spaces for Lambertian objects [1,4,5,13]. Similar techniques might be applied, simply 
using more basis functions, and including cast shadow effects, since cast shadows and 
irradiance have the same mathematical structure. Our results have direct implications 
in BTF modeling and rendering for representing illumination variability, and providing 
appropriate basis functions for compression and synthesis. 

3.3 Experimental Validation 

In this subsection, we present an initial quantitative experimental result motivating and 
validating our derivation. The next sections generalize these results to 3D, and present 
more thorough experimental validations. We used the experimental setup of Figure 2, 
determining the eigenvalue spectrum and illumination eigenmodes for both a sample of 
moss, and a flat piece of paper. The paper serves as a control experiment on a nearly 
Lambertian surface. Our results are shown in Figure 6. 

Eigenvalue Spectrum: As seen in Figure 6 (c), the eigenvalues (singular values) for 
moss when plotted on a log-log scale lie on a straight line with slope approximately 
-1.5, as expected. This contrasts with the expected result for a flat Lambertian surface, 
where we should in theory see a single eigenmode (simply the cosine term). Indeed, in 
our control experiment with a piece of paper, also shown in Figure 6 (c), 99.9% of the 
energy for the paper is in the first eigenmode, with a very fast decay after that. 

Illumination Eigenmodes: As predicted, the illumination eigenmodes are simply 
Fourier basis functions — sines and cosines. This indicates that a common set of illu- 
mination eigenfunctions may describe lighting-dependence in many 3D textures. 

4 3D Numerical Analysis of Cast Shadows 

In 3D, V-grooves can be rotated to any orientation about the vertical; hence, the direction 
of the Fourier basis functions can also be rotated. For a given V-groove direction , the 
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Fig. 6. (a): Moss 3D texture with significant shadowing. Experimental setup is as in Figure 2. 
(b): 6 images of the moss with different lighting directions, as well as a control experiment of 
paper (a flat near-Lambertian surface). Note the variation of appearance of moss with illumination 
direction due to cast shadows, especially for large angles. In contrast, while the overall intensity 
changes for the paper, there is almost no variation on the surface, (c): Decay of singular values 
for illumination eigenmodes for 3D textures is a straight line with slope approximately -1.5 on a 
logarithmic scale. In contrast, for a flat near-Lambertian surface, all of the energy is in the first 
eigenmode with a very rapid falloff. (d): The first four illumination eigenfunctions for moss, which 
are simply sines and cosines. 



2D derivation essentially still holds, since it depends on the monotonic increase in 
visibility as one moves along the groove, which still holds in 3D. The interesting question 
is, what is the set of illumination basis functions that encompasses all V-groove (and 
correspondingly Fourier) orientations in 3D? 

One might expect the basis functions to be close to spherical harmonics [9], the nat- 
ural extension of the Fourier basis to the sphere. However, we are considering only the 
visible upper hemisphere, and we will see that our basis functions take a somewhat sim- 
pler form than spherical harmonics or Zernike polynomials [7], corresponding closely to 
2D Fourier transforms. In this section, we report on the results of numerical simulations, 
shown in Figures 7, 8 and 9. We then verify these results with experiments on real 3D 
textures including moss, gravel and a kitchen sponge. 

4.1 Numerical Eigenvalue Spectrum and Illumination Eigenmodes 

For numerical simulation, we consider V-grooves oriented at (rotated by) arbitrary angles 
about the vertical, ranging from 0 to 27 t. For each orientation, we consider a number 
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Decay in singular values Loglog plot sing, val vs freq. 




Singular Value Number Frequency (k) - SingVal (k+1) 2 



Fig. 7. Left: Singular values for illumination basis functions due to cast shadows in a simulated 
3D texture (randomly oriented V-grooves), plotted on a linear scale. A number of singular values 
cluster together. Right: Decay of singular values [value vs frequency or square root of singular 
value number] on a logarithmic scale (with natural logarithms included as axis labels). We get a 
straight line with slope approximately -1.5 as expected. 

of V-groove angles with (3 ranging from 0 to n/2. In essence, we have an ensemble of 
a large number of V-grooves (1000 in our simulations). Each point on each V-groove 
has a binary visibility value for each point on the illumination hemisphere. We can 
assemble all this information into a large visibility matrix, where the rows correspond 
to V-groove points (image pixels), and the columns to illumination directions. Then, as 
in experiments with real textures like Figure 6, we do an SVD 3 to find the illumination 
eigenmodes and eigenvalues. 

Numerical Eigenvalue Spectrum: We first consider the eigenvalues or singular values, 
plotted in the left of Figure 7 on a linear scale. At first glance, this plot is rather sur- 
prising. Even though the singular values decrease with increasing frequency, a number 
of them cluster together. Actually, these results are very similar to those for irradiance 
and spherical harmonics [1,13,15], where 2k + 1 basis functions of order k are similar. 
Similarly, our eigenmodes are Fourier-like, with 2k + 1 eigenmodes at order k (with 
a total of (k + l) 2 eigenmodes up to order A;). Therefore, to determine the decay of 
singular values, it is more appropriate to consider them as a function of order k. We 
show k ranging from 1 to 15 in the right of Figure 7. 

As expected, the curve is almost exactly a straight line on a log-log plot, with a slope 
of approximately -1.5. The higher slope (-1.5 compared to -1 in 2D) is a natural conse- 
quence of the properties of Fourier series of a function with a curve discontinuity [10], 
as is the case in 3D visibility. The total energy (sum of squared singular values) at each 
order k goes as 1/A; 2 in both 2D and 3D cases. However, in 3D, each frequency band 
contains 2A; + 1 functions, so the energy in each individual basis function decays as 
1 / A: 3 , with the singular values therefore falling off as 1/A; 3 / 2 . 

Numerical Illumination Eigenmodes: The first nine eigenmodes are plotted in Fig- 
ure 8, where we label the eigenmodes using (m, n) with the net frequency given by 

3 Owing to the large size of the matrices both here and in our experiments with real data, SVD is 
performed in a 2 step procedure in practice. First, we find the basis functions and eigenvalues 
for each V-groove. A second SVD is then performed on these weighted basis functions. 
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k = m + | n | 
n ► -2 



Basis Functions on Hemisphere 
-1 0 
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0 
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Order 1 modes 
compared to 
those shown in 
Figure 11. 




m = 0, n = -2 m=l,n = -l m = 2, n = 0 m=l,n=l m = 0, n = 2 



Fig. 8. 3D hemispherical basis functions obtained from numerical simulations of V-grooves. Green 
denotes positive values and red denotes negative values. 8 and (f> are a standard spherical parame- 
terization, with the cartesian ( x , y , z) = (sin 8 cos (j>, sin 8 sin (j), cos 8). 



k = m + | n |, with k > 0, —k < n < k, and m = k — | n \. This labeling anticipates 
the ensuing discussion, and is also quite similar to that used for spherical harmonics. 

To gain further insights, we attempt to factor these basis functions into a separable 
form. Most commonly used 2D or (hemi) spherical basis functions are factorizable. For 
instance, consider the 2D Fourier transform. In this case, 

W mn (x, y) = U m (x)V n (y), (8) 

where W is the (complex) 2D basis function exp (imx) exp (iny), and U m and V n are 
ID Fourier functions (exp (imx) and exp (iny) respectively). Spherical harmonics and 
Zernike polynomials are also factorizable, but doing so is somewhat more complicated. 

w mn {e,4>) = u™{8)v n {ct>), (9) 

where V n is still a Fourier basis function exp (incj)) [this is because of azimuthal symmetry 
in the problem and will be true in our case too], and E7™ are associated Legendre 
polynomials for spherical harmonics, or Zernike polynomials. Note that (7™ now has 
two indices, unlike the simpler Fourier case, and also depends on azimuthal index n. 

We now factor our eigenmodes. The first few eigenfunctions are almost completely 
factorizable, and representable in a form similar to Equation 8, i.e., like a 2D Fourier 
transform, and simpler 4 than spherical harmonics or Zernike polynomials, 

W mn (0,<t>) = U m (0)V n (<j>). (10) 

Figure 9 shows factorization into ID functions U m (6) and V n ((f>). It is observed that 
the U m correspond closely to odd Legendre polynomials P 2 m+i- This is not surprising 
since Legendre polynomials are spherical frequency-space basis functions. We observe 

4 Mathematically, functions of the form of Equation 1 0 can have a discontinuity at the pole 8 = 0. 
However, in our numerical simulations and experimental tests, we have found that this form 
closely approximates observed results, and does not appear to create practical difficulties. 





A Fourier Theory for Cast Shadows 



157 



Basis Functions U and V 



Mean Term 





Fig. 9. The functions in Figure 8 are simple products of ID basis functions along elevation 8 and 
azimuthal ij> directions, as per Equation 10. Note that the V±n are sines and cosines while the 
Um are approximately Legendre polynomials (P 3 for m = 1, P 5 for m = 2). Figure 12 shows 
corresponding experimental results on an actual 3D texture. 



only odd terms 2 m + 1, since they correctly vanish at 9 = 7r/2, when a point is always 
shadowed. V n are simply Fourier azimuthal functions or sines and cosines. The net 
frequency k = m + \ n \ , with there being 2k + 1 basis functions at order k. 



4.2 Results of Experiments with Real 3D Textures 

In this subsection, we report on empirical results in 3D, showing that the experimental 
observations are consistent with, and therefore validate, the theoretical and numerical 
analysis. We considered three different 3D textures — the moss and gravel, shown in 
Figure 6, and a kitchen sponge, shown in Figure 14. We report in this section primarily 
on results for the sponge; results for the other samples are similar. 

For each texture, we took a number of images with a fixed overhead camera view, 
and varying illumination direction. The setup in Figure 2 shows a 2D slice of illumi- 
nation directions. For the experiments in this section, the lighting ranged over the full 
3D hemisphere. That is, 9 ranged from [14°, 88°] in 2 degree increments (38 different 
elevation angles) and (j> from [—180°, 178°] also in 2 degree increments (180 different 
azimuthal angles). The acquisition setup restricted imaging near the pole. Hence we 
captured 6840 images (38 x 180) for each texture. This is a two order of magnitude 
denser sampling than the 205 images acquired by Dana et al. [3] to represent both light 
and view variation, and provides a good testbed for comparison with simulations. 

For numerical work, we then assembled all of this information in a large matrix, 
the rows of which were image pixels, and the columns of which were light source 
directions. Just as in our numerical simulations, we then used S VD to find the illumination 
eigenmodes and eigenvalues. We validate the numerical simulations by comparing the 
experimental results for real data to the expected (i.e., numerical) results just described. 
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Fig. 10. Left: Plot of singular values for the sponge on a linear scale. Right: Singular values vs 
frequencies on a logarithmic scale, with natural log axis labels. These experimental results should 
be compared to the predicted results from numerical simulation in Figure 7. 



Experimental eigenvalue spectrum: Figure 10 plots the experimentally observed 
falloff of eigenvalues. We see on the left that eigenmodes 2-4 (the first three after the 
mean term) cluster together, as predicted by our numerical simulations. One can see a 
rather subtle effect of clustering in second order eigenmodes as well, but beyond that, 
the degeneracy is broken. This is not surprising for real data, and consistent with similar 
results for PCA analysis in Lambertian shading [13]. 

Computing the slope for singular value dropoff is difficult because of insufficiency 
of accurate data (the first 20 or so eigenmodes correspond only to the first 5 orders, 
and noise is substantial for higher order eigenmodes). For low orders (corresponding to 
eigenmodes 2-16, or orders 1-3), the slope on a loglog plot is approximately -1.6, as 
shown in the right of Figure 10, in agreement with the expected result of -1.5. 

Experimental illumination eigenmodes: We next analyze the forms of the eigenmodes; 
the order 1 modes for moss, gravel and sponge are shown in Figure 1 1 . The first order 
eigenmodes observed are linear combinations of the actual separable functions — this is 
expected, and just corresponds to a rotation. 



Order 1 (k = 1) modes for moss, gravel and sponge 



Moss I 



Gravel 



0.99 (0,-1) + 0.13 (1,0) 0.98 (1,0) + 0.16 (0,1) 0.99 (0,1) + 0.10 (1,0) 



0.99 (0,-1) + 0.13 (1,0) 0.97 (1,0) + 0.24 (0,1) 0.98 (0,1) + 0.19 (1,0) 



Sponge 



0.89 (0,-1) + 0.45 (1,0) 0.84 (1,0) + 0.53 (0,1) 0.96 (0,1) + 0.28 (1,0) 



Fig. 11. Order 1 eigenmodes experimentally observed for moss, gravel and sponge. Note the 
similarity between the 3 textures, and to the basis functions in Figure 8. The numbers below 
represent each eigenmode as a linear combination of separable basis functions. 
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Fig. 12. Factored basis functions U m (9) and V n (<j>) for sponge. The top row shows the mean 
eigenmode, and the functions Uo(9) and U\{9). Below that are the nearly constant Vo(</>) and 
the sinusoidal Vi ((/)), V-i(cj>). The colors red, blue and green respectively are used to refer to the 
three order 1 eigenmodes that are factored to obtain t/ m and V u . We use black to denote the mean 
value across the three eigenmodes. It is seen that all the eigenmodes have very similar curves, 
which also match the results in Figure 9. 

We next found the separable functions U m and V„ along 9 and 6 by using an SVD 
of the 2D eigenmodes. As expected, the U and V basis functions found separately from 
the three order 1 eigenmodes were largely similar, and matched those obtained from 
numerical simulation. Our plots in Figure 12 show both the average basis functions (in 
black), and the individual functions from the three eigenmodes (in red, blue and green) 
for the sponge dataset. We see that these have the expected forms, and the eigenmodes 
are well described as a linear combination of separable basis functions. 

5 Representation of Cast Shadow Effects in 3D Textures 

The previous sections have shown how to formally analyze cast shadow effects, numer- 
ically simulated illumination basis functions, and experimentally validated the results. 
In this section, we make a first attempt at using this knowledge to efficiently represent 
lighting variability due to cast shadows in 3D textures. 

In particular, our results indicate that a common set of illumination basis functions 
may be appropriate for many natural 3D textures. We will use an analytic basis motivated 
by the form of the illumination eigenmodes observed in the previous section. We use 
Equation 10, with the normalized basis functions written as 



4:TTi 3 

W mn (M) = Y P2m+1 (COS 9)aZ n (<£), (11) 

where az n (</)) stands for cos ruj) or sin ncf), depending on whether n is plus or minus (and 
is v /T 72 for n = 0), while P 2 m+i are odd Legendre Polynomials. 

This basis has some advantages over other possibilities such as spherical harmonics 
or Zernike polynomials for representing illumination over the hemisphere in 3D textures. 
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- The basis is specialized to the hemisphere, unlike spherical harmonics. 

- Its form, as per Equations 10 and 1 1 is a simple product of ID functions in 9 and <fi, 
simpler than Equation 9 for Zernike polynomials and spherical harmonics. 

- For diffuse textures, due to visibility and shading effects, the intensity goes to 0 
at grazing angles. These boundary conditions are automatically satisfied, since odd 
Legendre polynomials vanish at 9 = 7t/2 or cos 9 = 0. 

- Our basis seems consistent with numerical simulations and real experiments. 

Figure 1 3 compares the resulting error with our basis to that for spherical harmonics, 
Zernike polynomials and the numerically computed optimal SVD basis for the sponge 
example. Note that the SVD basis performs best because it is tailored to the particular 
dataset and is by definition optimal. However, it requires prior knowledge of the data on 
a specific 3D texture, while we seek an analytic basis suitable for all 3D textures. These 
results demonstrate that our basis is competitive with other possibilities and can provide 
a good compact representation of measured illumination data in textures. 

We demonstrate two simple applications of our analytic basis in Figure 14. In both 
cases, we use our basis to fit a function over the hemisphere. For 3D textures, this 
function is the illumination-dependence, fit separately at each pixel. The first application 
is to compression, wherein the original 6840 images are represented using 100 basis 




- Zernike Polynomials 

- Spherical Harmonics 

- Our Analytic Basis 

- Numerically computed SVD 



Fig. 13. Comparison of errors from different bases on sponge example (the left shows the first 6 
terms, while the right shows larger numbers of terms). The SVD basis is tailored to this particular 
dataset and hence performs best; however it requires full prior knowledge. 




Fig. 14. On the left is one of the actual sponge images. In the middle is a reconstruction using 
100 analytic basis functions, achieving a compression of 70: 1 . In the right, we reconstruct from a 
sparse set of 390 images, using our basis for interpolation and prediction. Note the subtle features 
of appearance, like accurate reconstruction of shadows, that are preserved. 
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function coefficients at each pixel. A compression of 70:1 is thus achieved, with only 
marginal loss in sharpness. This compression method was presented in [8], where a 
numerically computed SVD basis for textures sampled in both lighting and viewpoint 
was used. Using a standard analytic basis is simpler, and the same basis can now be used 
for all 3D textures. Further, note that once continuous basis functions have been fit, they 
can be evaluated for intermediate light source directions not in the original dataset. Our 
second application is to interpolation from a sparse sampling of 390 images. As shown 
in Figure 14, we are able to accurately reconstruct images not in the sparse dataset, 
potentially allowing for much faster acquisition times (an efficiency gain of 20 : 1 in 
this case), without sacrificing the resolution or quality of the final dataset. 

It is important to discuss some limitations of our experiments and the basis proposed 
in equation 1 1 . First, this is an initial experiment, and a full quantitative conclusion 
would require more validation on a variety of materials. Second, the basis functions 
in equation 1 1 are a good approximation to the eigenmodes derived from our numer- 
ical simulations, but the optimal basis will likely be somewhat different for specific 
shadowing configurations or 3D textures. Also, our basis functions are specialized to 
hemispherical illumination for macroscopically flat textures; the spherical harmonics 
or Zernike polynomials may be preferred in other applications. Another point concerns 
the use of our basis for representing general hemispherical functions. In particular, our 
basis functions go to 0 as 9 = 7r/2, which is appropriate for 3D textures, and similar to 
some spherical harmonic constructions over the hemisphere [22]. However, this makes 
it unsuitable for other applications, where we want a general hemispherical basis. 



6 Conclusions 

This paper formally analyzes cast shadows, showing that a simple Fourier signal- 
processing framework can be derived in many common cases. Our results indicate a 
theoretical link between cast shadows, and convolution formulae for irradiance and 
more general non-Lambertian materials [1,15,16]. This paper is also a first step in quan- 
titatively understanding the effects of lighting in 3D textures, where cast shadows play 
a major role. In that context, we have derived new illumination basis functions over the 
hemisphere, which are simply a separable basis written as a product of odd Legendre 
polynomials and Fourier azimuthal functions. 
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Abstract. We present a novel approach to surface reconstruction from 
multiple images. The central idea is to explore the integration of both 
3D stereo data and 2D calibrated images. This is motivated by the fact 
that only robust and accurate feature points that survived the geome- 
try scrutiny of multiple images are reconstructed in space. The density 
insufficiency and the inevitable holes in the stereo data should be filled 
in by using information from multiple images. The idea is therefore to 
first construct small surface patches from stereo points, then to progres- 
sively propagate only reliable patches in their neighborhood from images 
into the whole surface using a best-first strategy. The problem reduces 
to searching for an optimal local surface patch going through a given set 
of stereo points from images. This constrained optimization for a sur- 
face patch could be handled by a local graph-cut that we develop. Real 
experiments demonstrate the usability and accuracy of the approach. 

1 Introduction 

Surface reconstruction from multiple images is one of the most challenging and 
fundamental problems of computer vision. Although effective for computing cam- 
era geometry, most recent approaches [9,6] reconstruct only a 3D point cloud of 
the scene, whereas surface representations are indispensable for modelling and 
visualization applications. Surface reconstruction is a natural extension of the 
point-based geometric methods. Unfortunately, using 3D data from such a pas- 
sive system in the same way as range scanner data is often insufficient for a direct 
surface reconstruction method, because the 3D points are sparse, irregularly dis- 
tributed and missing in large areas. On the other hand, most image-based surface 
reconstruction approaches equally consider all surface points. They ignore the 
feature points although these points can be precisely matched between multiple 
images and therefore lead to accurate 3D locations. These shortcomings motivate 
us to develop a new approach to constructing a surface from stereo data [17], 
while using extra image information that is still available from a passive system. 

^ ARTIS is a research project in the GRAVIR/IMAG laboratory, a joint unit of CNRS, 

INPG, INRIA and UJF. 
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Surface reconstruction from 3D data. Surface reconstruction from scanned data 
is a traditional research topic. Szeliski et al. [27] use a particle-based model of 
deformable surfaces; Hoppe et al. [11] present a signed distance for implicit sur- 
faces; Curless and Levoy [4] describe a volumetric method; and Amenta et al. [1] 
develop a set of computational geometry tools. Recently, Zhao et al. [32] develop 
a level-set method based on a variational method of minimizing a weighted 
minimal surface. Similar work is also developed by Whitaker [30] using a MAP 
framework. Surface reconstruction from depth data obtained from stereo systems 
is more challenging because the stereo data are usually much sparser and less 
regular. Fua [8] uses a system of particles to fit the stereo data. Kanade et al. [20] 
propose a deformable mesh representation to match the multiple dense stereo 
data. These methods that perform reconstruction by deforming an initial model 
or tracking discretized particles to fit the points are both topologically and nu- 
merically limited. Compared with our method, Hoff and Ahuja [10] and Szeliski 
and Golland [28] handle the data in an unordered way. Tang and Medioni [29] 
and Lee et al. [16] formulate the problem under a tensor voting framework. 

Surface reconstruction from 2D images. Recently, several volumetric algorithms 
[25, 15, 5, 14] simultaneously reconstruct surfaces and obtain dense correspon- 
dences. The method of space carving or voxel coloring [25, 15] works directly 
on discretized 3D space, voxels, based on their image consistency and visibility. 
These methods are purely local and therefore rely either on numerous viewpoints 
or on textured surfaces to achieve satisfying results. Kolmogorov and Zabih [14], 
Roy [23], and Hishikawa and Geiger [12] propose direct discrete minimization 
formulations that are solved by graph-cuts. These approaches achieve disparity 
maps with accurate contours but limited depth precision. The latest improve- 
ment by Boykov and Kolmogorov [3] overcomes this point but is restricted to 
data segmentation. Paris et al. [22] also propose a continuous functional but is 
restricted to open surfaces. Faugeras and Keriven [5] propose a method imple- 
mented by level-sets, which is intrinsically a multiple view and naturally handles 
the topology and occlusion problems. However, it is not clear under what con- 
ditions their methods converge as the proposed functional seems non-convex. 

Some ideas in our approach are inspired by these methods, but fundamentally 
these methods either solely operate on 3D data or on 2D data. Our approach is 
more similar to the work of Lhuillier and Quan [18], which integrates 3D and 
2D data under a level-set framework and suffers from the same limitations. Our 
strategy to achieve this goal is to perform a propagation in 3D space starting 
from reliable feature points. The propagation is driven by image information 
to overcome the insufficiency of 3D data. Among possible types of image infor- 
mation, cross-correlation considers the local texture information and gives the 
surface location with great precision - especially its zero-mean normalized ver- 
sion (ZNCC) that is robust to lighting variations, but error matches occur when 
coherence constraints are not taken into consideration. We define a propagation 
technique as a local optimization to combine coherence constraints and image 
information. We make the surface grow patch by patch and we show that with 
an appropriate graph-cut technique, each patch is optimal under our hypotheses. 
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Fig. 1. Our framework is centered on the propagation loop that progressively extends 
the surface. Each iteration of the loop picks a seed, builds an optimal patch surrounding 
it, updates the surface and selects new seeds for further propagation. This process is 
initialized with 3D points computed by a stereoscopic method. 



Our contributions. Our approach explores the integration of both stereo data 
and calibrated images. It fully exploits the feature points to start the reconstruc- 
tion process to avoid potential ambiguities and build a precise surface. This is 
achieved through a new and innovative propagation framework controlled by 2D 
images. The graph-cut optimization is used locally to guarantee that the prop- 
agation is optimal under our hypotheses. Theoretical justifications are given for 
a formal understanding of the process and technical issues are exposed. 



2 Problem Statement and Formulation 

Given a set of calibrated images { Cj }, a set of stereo points Q = {q;} derived 
from the given images, the goal is to reconstruct a surface S of the objects in the 
scene. The problem is different to surface reconstructions from multiple images 
as addressed in [5, 25, 15] in which only 2D images are used without any 3D 
information. It is also different to surface reconstructions from scanned 3D data 
without 2D image information [11,27,4,8,20,32], as the 3D stereo data are often 
insufficient in density and accuracy for a traditional surface reconstruction. 

Futhermore, most of the existing surface reconstruction techniques charac- 
terize their results by a global functional. However, Blake and Zisserman [2] 
demonstrate that their optimization scheme follows local rules. For instance, it 
is shown that a discontinuity has a local influence on the result and that this 
influence disappears beyond a given distance. Examining the global approach 
of Faugeras and Keriven [5] leads to the same remark: even if the initial for- 
mulation relies on a surface integral, the practical resolution is performed with 
local differential operators. This motivates our approach (Figure 1): the surface 
is built patch by patch, each one being locally optimal for a global criterion. 

Since the patch is local, it is reasonable to establish the local coodinates 
system (x, y , z) where xy is the tangent plane to the object surface and z is the 
normal, to parameterize the patch as a single valued height field z = h(x,y). 
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Then, we look for a surface patch {(a;, y, h(x, j/))|(x, y) € T>} by minimizing a 
functional of type: 




c(x,y,h(x,y))dxdy, 



(1) 



where c(x, y, h(x , y)) is a cost function accounting for the consistency of the sur- 
face point (x, y , h(x 1 y)) in multiple images, for instance, either photo-consistency 
or cross-correlation. 

We also require that the optimized surface patch goes through the existing 
3D points if they do exist in the specified neighborhood. We are therefore looking 
for an interpolating surface through the given 3D points. That is, to minimize 




c(x, y, h(x, y))dx dy 



with 



h{x i: yi) = zu 



( 2 ) 



where ( Xi,yi,Zi ) is the local coordinate of a given 3D point q, . This functional 
can either be regularized as a minimal surface ff c(x,y, h(x,y))ds leading to a 
level-set implementation [26,5,13]. We do not follow this path as it very often 
results in an over-smoothed surface [18] due to high order derivatives involved in 
the dynamic surface evolution. Instead, we apply a recent approach that reaches 
sharper results [22]. 

We use first derivatives as smoothing term 



s(x,y,h(x,y)) = a 



dh 

dx 



( x i y) 



dh 

dy 



(x,y) 



( 3 ) 



with a controlling the importance of s(-), to minimize the following functional 




(c(x, y, h(x , y)) + s(x, y, h{x, y))) dx dy 



with 



h{xi,yi ) = Zi. 



( 4 ) 



The advantage of this formulation is that this continuous functional can be 
discretized, then optimized by a graph-cut algorithm. This is an adaption of the 
graph-cut approach, with a simplified smoothing term s(-) to a local environment 
that we call a local graph-cut. The choice of c(-) is purely independent of the 
general framework. Recently researchers mainly focus on photo- consistency and 
cross-correlation. In this paper we present an implementation of the propagation 
framework with c(-) = g(zncc(-)) with g(-) being a decreasing function like x K > 
(1 — x) to fit the minimization goal. 

This local formulation relies on the surface orientation to choose the param- 
eterization of the local patch, the normal direction of a given 3D point has to 
be estimated from its neighborhood. We define a seed point as a couple (p,, n,;) 
where is the 3D position and rq the surface normal direction at p This is 
dependent on the reconstruction algorithms used, which is further discussed in 
Section 4. The above analysis finally leads to a surface that optimally interpo- 
lates the given set of 3D stereo points by 2D image information defined in the 
functional (4). 
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3 Algorithm Description 

We propose here an algorithm to apply the previously discussed framework. 
This algorithm follows the general organization presented in Figure 1. First, 
using the technique discussed in Section 4, 3D stereo points are computed. The 
local surface orientation at these points is estimated to attach with the points to 
form the initial seeds. Then the information is propagated from the most reliable 
data to grow the surface. 



3.1 Initialization from 3D Stereo Points 

To start the algorithm, we initialize the list of seeds, from which the surface is 
propagated gradually. A high number of 3D points are robustly computed from 
the given set of images by stereo and bundle-adjustment methods (Section 4). 
Figure 4 shows these points for different data sets. Compared with standard 
sparse points, these points are highly redundant and well distributed in several 
parts. This redundancy makes it possible to evaluate the surface orientation. It is 
important to address here that these input points are regarded as a “black box” 
without assuming any property. Other preprocessing methods are also possible 
as long as the surface orientation can be estimated at each given position. 

For each 3D point p,, the surface orientation n * is provided by the symmet- 
ric 3 x 3 positive semi-definite matrix (p )p| q(^ — Pi) ® (y — Pi), where 

£>,,(pi) denotes the ball of radius r. Among the eigenvectors vi,v 2 ,v 3 respec- 
tively associated to the eigenvalues Ai > A 2 > A 3 , we choose n, to be either v 3 
or — v 3 . The sign depends on the cameras used to reconstruct p, . The confidence 
is indicated by the ratio between the lowest eigenvalue and the larger ones. 

£>r(Pi) f) Q may contain very few points, baffling the orientation estimation. 
And in dense regions, a large radius results in an over-smoothed estimation 
whereas a small radius makes the estimation sensitive to noise. Therefore r is 
defined as a function of p,: in dense regions, r is fixed to a reference value r^ ense 
representing the minimum scale. In the diluted regions, the radius is increased 
so that B r (pi) contains at least k 3D stereo points. From many experiments, a 
good compromise is to define r dense to be the radius of local patches and k to be 
15-20. 



3.2 Propagation Loop 

Starting from the seed list initialized with the 3D stereo points, this step makes 
the surface grow until it reaches the final result. Each seed is handled one by 
one to generate a surface patch and create new seeds by the local graph-cut 
technique. It is divided into the 3 following issues: 

1. The selection of the next seed from the current seed list. 

2. The generation of an optimal patch from this seed. 

3. The creation of the new seeds on this patch to add in the list. 
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3.3 Selection of the Next Seed 

To select a new seed (p, n) for propagation, we need a criterion II to evaluate 
how “good for propagation” a seed is. Once we have such a criterion, we follow 
a classical best-first strategy to ensure that the most reliable seed is picked each 
time. This choice directly drives the propagation because it indicates where the 
growing regions are. It is our global control over the surface reconstruction. 

First of all, the initial seeds (i.e. 3D stereo points) are regarded as the reliable 
3D points on the surface. Therefore, they are always selected before the generated 
seeds. The algorithm ends when there is no seed left in the list. 

Selection criterion for 3D stereo points. The criterion of the stereo points is 
defined as the confidence of the orientation estimation described in Section 3.1. 
This confidence is related to the local planarity of the stereo points and therefore 
estimates the accuracy of the normal computation. This leads to II = The 
visibility of a stereo point is given by the cameras used to reconstruct it. 

Selection criterion for generated seeds. For a generated seed , we use the ZNCC 
correlation score Z by its two most front-facing cameras, since a strong match 
gives a high confidence. This strategy ensures that the surface grows from the 
part which is more likely to be precise and robust. Thus: II = Z . Correct visibil- 
ity is computed thanks to this propagation criterion. If the criterion is computed 
from occluded cameras, the local textures in both images does not match and 
the ZNCC value is low. Therefore a seed without occlusion is processed before 
a seed with occlusion. The occluded parts “wait” until other parts are recon- 
structed. The visibility of the processed seed is classically determined by the 
current propagated surface using a ray-tracing technique. 

3.4 Generation of an Optimal Patch from a Given Seed 

Given a seed (p,n) with its visibility, a patch is grown in its neighborhood to 
extend the existing surface. Since it is a local representation of the surface, it 
is parameterized by a height field z = h(x,y) relatively to the tangent plane 
of the surface. Therefore, for the local coordinate ( x,y,z ), the x and y axes 
are arbitrary chosen orthogonally to n, and the z axis is parallel to n with the 
coordinate origin at p. A surface patch is defined by {(x,y,h(x,y))\(x,y) £ V} 
that is minimal for functional (4). This patch is also enforced to pass through 
the stereo points and the previously computed surface. 

The graph-cut technique described in [22] is then applied. This technique 
reaches an exact minimum of the functional (4) up to any arbitrary discretiza- 
tion. This involves a graph whose cuts represent the set of all possible surfaces 
z = h(x,y). Thus the classical max-flow problem [7] is equivalent to an ex- 
haustive search leading to a minimal solution of (4). Moreover, it allows us to 
constrain h(x, y) to a specified z-range. 

To use this technique, the local domain is discretized as a regular rectangular 
grid {Ai, . . . , X n J x {F 1; . . . ,Y ny } x {Z 1: ..., Z Uz } separated by Ax, Ay, Az. 
The functional (4) is discretized into the following form: 
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]T Yj))AxAy + | h(X i+1 , Yj) - h(X u Yj)\Ay 

i 3 

+ \h(Xi,Y j+1 ) — h(Xi,Yj)\Ax) with the constraint h{xi,yi) = Zi . (5) 

For any ( x , y) £ T> where the surface has been already computed (including 
the 3D stereo points), the h value is restricted to the only corresponding 2 value. 
Then for each remaining (x, y, z ) position, a ZNCC value is computed using the 
two most fronto-parallel visible cameras to limit the perspective distortion. These 
ZNCC values are transformed into c(-), resulting in a dense 3D sampling. 

Based on the above discrete form (5), a graph is embedded into a 3D grid 
superimposed on the voxels with correspondence shown in Figure 2. Finally, 
this graph-flow problem is solved and the resulting minimum cut is equivalent 
to the optimal patch. Graph design and proof of optimization are in [22]. The 
confidence of the patch is given by the quantity of the maximum flow F. A 
smaller value gives a higher confidence, while the large enough value indicates 
that this patch should be discarded. 

Even if the patch is trustworthy, the borders of the patch might not be 
suitable for the final surface. The border points have a truncated neighborhood 
and potentially ignore some neighboring data. To avoid this caveat, only the 
center part of the optimal solution is kept (see Figure 3). The discarded border 
only provides a complete neighborhood to the center part. Finally, this center 
part is used to extend the existing surface. 

3.5 Creation of the New Seeds to Add in the List 

We have grown a surface patch for the selected seed in the previous section. 
To continue the propagation, new seeds are created from this patch. These new 
seeds will be selected later to drive further propagation. The location of the new 
seeds is driven by several aspects. 
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Patch quality. First of all, the quantity of the maximum flow F indicates the 
confidence of the optimal patch. With low enough confidence, the surface 
patch is discarded and no seed is created. 

Match quality. A point with a high ZNCC value Z is more likely to provide a 
robust starting point for further propagation. 

Surface regularity. A singular point does not represent accurate properties 
of the patch. Using the principal curvatures n\ and K2, points with high 
curvature K = t are therefore to be avoided. 

Propagation efficiency. To ensure a faster propagation, distant points are 
preferred. This relies on the distance D between the patch center and the 
potential new seeds. 

A value A is computed for each potential location of a new seed to represent 
its appropriateness relative to these objectives. 

„7(z) n y(n) 

Zj • U . . 

^ = pl{F) . W 

where 7(-) are non-negative weights to balance the different criteria. In our 
algorithm, we use 7 (Z) = 7 (D) = 7 (F) = "f(K) = 1, while further study is 
needed to evaluate the importance of each criterion. 

The number of new seeds created is inspired by the triangle mesh configu- 
ration. From the Euler property, the average number of neighbors of a vertex is 
6 and the average angular distance between two neighbors is Thus, the di- 
rections of the new seeds in relation to the patch center are selected so that the 
angular distance between two neighbor seeds lies in [ 2 J r . ^]. In each direction, 
the location p' with the highest A is selected and the normal n' at p' is attached 
to form a new seed. 



4 Implementation and Experiments 

4.1 Acquisition of 3D Stereo Data 

3D stereo data can be obtained using different approaches. A traditional stereo 
rig [21,24] could deliver quite dense points, but it only gives a small part of 
the object. Automatically merging partial stereo data into a complete model 
is not easy, except that multiple stereo rigs are calibrated off-line. We choose 
a more general reconstruction method from an uncalibrated sequence [6,9,31]. 
Our quasi-dense implementation [17] is similar to the standard uncalibrated 
approaches with the main difference that we use a much denser set of points 
re-sampled from the disparity map instead of points of interest. To model a 
complete object, we usually make a full turn around the object by capturing 
about 30 images to compute the geometry of the sequence. 
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4.2 Efficiency Consideration 

To select a seed to propagate, we use selection criterion 77 to extend the surface 
from its most robust regions (Section 3.3). However, it may cause inefficiency, 
because the neighboring surface may have already been created from other seeds. 

Therefore, the area criterion A is added to define 77 = A- 77 . The definition of 
A represents the efficiency of the patch generation. It is based on the total patch 
area a-p and the covered area a c : A = 1 — ^ . Since this value evolves during the 
process, it is stored with the seed and updated in a “lazy” way: When the best 
seed is selected by 77, the corresponding A is updated according to the current 
surface. If the value changes to a non-zero value (it always decreases because 
the surface is growing), 77 is updated and the seed is put back in the list. If 
it changes to zero, the seed is discarded because it would provide no surface 
extension. This case mainly occurs for 3D stereo points which are redundant. 
Otherwise ( i.e . A does not change), the seed is selected for further processing. 



4.3 Representation of the Growing Surface with a Distance Field 

We use a distance field to register the growing surface. The surface is repre- 
sented by the signed distance function d in its narrow band , which is Euclidean 
distance with the sign indicating either inside or outside. When a surface patch 
is generated, the distance field is updated in its narrow band. This results in the 
merger of this patch with its neighboring ones. Finally, the zero set Z(d) is our 
estimation for S , which is extracted by the marching cube method [19]. 



4.4 Experimental Results 

Three typical modelling examples are shown in Figures 4. We usually acquire 
about 30 images with a hand-held camera around the objects. The first “toy” 
example is used to examine the correctness and robustness of our algorithm. The 
“toy” is difficult since fur is traditionally hard for surface reconstruction. We have 
made a tradeoff between the local features and the orientation robustness based 
on the local coherence. As we can see from the reconstructed shape, our algorithm 
handles occlusion correctly. Two “face” examples illustrate the accuracy of our 
algorithm. One face has more texture, while the other has less. The details around 
the eyes, the noses and the ears can be clearly seen thanks to the exploitation 
of the feature points. The shape is also satisfactory in the other regions, such 
as the foreheads and the cheeks because of the new propagation technique. It is 
important to address that the final shape optimally interpolates the given set of 
3D stereo points by 2D image information defined in the functional (4). 

The results in the above two face examples are of almost same quality. We 
may notice that the 3D feature points are good surface points, but their quan- 
tity is not an important factor for reconstruction. Since the derived 3D points 
are highly redundant, actually many of the points provide limited information. 
Future study is needed for the quantity and the positions of the initial seeds. 
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Fig. 4. Each row shows the results for one example: (a) One of the input images; (b) 
Reconstructed 3D points; (c,d) Surface shape at two different viewpoints; (e) Surface 
shape with color. 




Fig. 5. Comparative results: (a) Space carving only gives a rough estimation; (b) Level 
set method yields an over-smoothed result; (c) The propagation gives the most detailed 
surface geometry. 
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Figure 5 shows a comparative study among the space carving method [15], 
the level set method [18] and our propagation approach. The space carving 
method only uses image information, and misses a lot of details on the nose, the 
hair and the ears due to its photo-consistency criterion, which often yields an 
over-estimation on the intensity-homogeneous regions. The level set method and 
our approach are both based on stereo points and images. However the level set 
method tends to over-smooth the surface by losing many geometric details. 

5 Conclusions 

We have proposed a novel approach for surface reconstruction by propagating 
3D stereo data in multiple 2D images. It is based on the fact that the feature 
points can be accurately and robustly determined. These points give important 
information for surface reconstruction. On the other hand, they are not sufficient 
to generate the whole surface representation. The major motivation of this study 
is to improve the insufficiency of 3D stereo data by using original 2D images. 

Our strategy to achieve this goal is to perform a propagation in a 3D space 
starting from reliable feature points. We have introduced a new functional (4) 
integrating both stereo data points and image information. The surface grows 
patch by patch from the most reliable regions by the local graph cut technique. 
Finally, the shape optimally interpolates the given set of 3D stereo points by 
2D image information defined in the functional (4). This approach have been 
extensively tested on real sequences and very convincing results have been shown. 
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Abstract. We analyze visibility from static sensors in a dynamic scene with mov- 
ing obstacles (people). Such analysis is considered in a probabilistic sense in the 
context of multiple sensors, so that visibility from even one sensor might be suffi- 
cient. Additionally, we analyze worst-case scenarios for high-security areas where 
targets are non-cooperative. Such visibility analysis provides important perfor- 
mance characterization of multi-camera systems. Furthermore, maximization of 
visibility in a given region of interest yields the optimum number and placement 
of cameras in the scene. Our analysis has applications in surveillance - manual 
or automated - and can be utilized for sensor planning in places like museums, 
shopping malls, subway stations and parking lots. We present several example 
scenes - simulated and real - for which interesting camera configurations were 
obtained using the formal analysis developed in the paper. 



1 Introduction 

We present a method for sensor planning that is able to determine the required number and 
placement of static cameras (sensors) in a dynamic scene. Such analysis has previously 
been presented for the case of static scenes where the constraints and obstacles are static. 
However, in many applications, apart from these static constraints, there exists occlusion 
due to dynamic objects (people) in the scene. In this paper, we incorporate these dynamic 
visibility constraints into the sensor planning task. These constraints are analyzed in a 
probabilistic sense in the context of multiple sensors. Furthermore, we develop tools 
for analyzing worst-case visibility scenarios that are more meaningful for high-security 
areas where targets are non-cooperative. 

Our analysis is useful for both manned and automated vision systems. In manned 
systems where security personnel are looking at the video stream, it is essential that 
the personnel have visibility of the people in the scene. In automated systems, where 
advanced algorithms are used to detect and track multiple people from multiple cameras, 
our analysis can be used to place the cameras in an optimum configuration. 

Automated Multi-camera vision systems have been developed using a wide range of 
camera arrangements. For better stereo matching, some systems) 1] use closely-spaced 
cameras. Others [2,3] adopt the opposite arrangement of widely separated cameras for 
maximum visibility. Others [4] use a hybrid approach. Still others [5,6,7], use multiple 
cameras for the main purpose of increasing the field of view. In all these systems, there is 
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a need for analyzing the camera arrangement for optimum placement. In many cases, our 
method can be utilized without any alteration. In systems that have additional algorithmic 
requirements (e.g. stereo matching), further constraints - hard or soft - can be specified 
so that the optimum camera configuration satisfies (hard) and is optimum (soft) w.r.t. 
these additional constraints. 

In addition to providing the optimum configuration, our analysis can provide a gold 
standard for evaluating the performance of these systems under the chosen configu- 
ration. This is because our analysis provides the theoretical limit of detectability. No 
algorithm can surpass such a limit since the data is missing from the images. Thus, one 
can determine as to how much of the error in a system is due to missing data, and how 
much of it is due to the chosen algorithm. 

Sensor planning has been researched quite extensively, especially in the robotics 
community, and there are several different variations depending on the application. One 
set of methods use an active camera mounted on a robot. The objective then is to move the 
camera to the best location in the next view based on the information captured uptil now. 
These methods are called next view planning[8,9,10]. Another set of methods obtain a 
model (either 2D or 3D) of a scene by optimum movement of the camera [11,12], Such 
model acquisition imposes certain constraints on the camera positions, and satisfaction 
of these constraints guarantees optimum and stable acquisition. 

Methods that are directly related to ours are those that determine the location of static 
cameras so as to obtain the best views of a scene. This problem was originally considered 
in the computational geometry literature as the art-gallery problem [13]. The solutions 
in this domain utilize simple 2D or 3D scene models and simple assumptions on the 
cameras and occlusion in order to develop theoretical results and efficient algorithms 
to determine good sensor configurations (although the NP-hard nature of the problem 
typically necessitates an approximate solution). Several researchers [14,15,16,17] have 
studied and incorporated more complex constraints based on several factors not limited 
to (1) resolution, (2) focus, (3) field of view, (4) visibility, (5) view angle, and (6) prohib- 
ited regions. In addition to these “static” constraints, there exist additional “visibility” 
constraints imposed by the presence of dynamic obstacles. Such constraints have not 
been analyzed earlier and their incorporation into the sensor planning task constitutes 
the novel aspect of our work. 

The paper is organized as follows. Section 2 develops the theoretical framework for 
estimating the probability of visibility of an object at a given location in a scene for a 
certain configuration of sensors. Section 3 introduces some deterministic tools to analyze 
worst-case visibility scenarios. Section 4 describes the development of a cost function 
and its minimization in order to perform sensor planning in complex environments. 
Section 5 concludes the paper with some simulated and real experiments. 



2 Probabilistic Visibility Analysis 

In this section, we analyze probabilistically the visibility constraints in a multi-camera 
setting. Specifically, we develop tools for evaluating the probability of visibility of an 
object from at least one sensor. Since this probability varies across space, this probability 
is recovered for each possible object position. 
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Fig. 1. (a) Scene Geometry used for stochastic reasoning, (b) The distance up to which an object 
can occlude another object is proportional to its distance from the sensor. 



2.1 Visibility from at Least One Sensor 

Assume that we have a region Ik of area A observed by n sensors [Fig. 1 (a)] . Let £ t be the 
event that a target object 0 at location L is visible from sensor i. The probability that 0 is 
visible from at least one sensor can be expressed mathematically as the union P([ J" =1 £, j 
of these events, and it can be expanded using the inclusion-exclusion principle as: 

p (U £ <) = E p ( £ <)-E p ( £ * n£ ;) + ••• (i) 

i \/i i<j i 

The motivation for this expansion is that it is easier to compute the terms on the RHS 
(right hand side) compared to the one on the LHS. 

In order to facilitate the introduction of the approach to be followed in computing 
P([X=i £i), we consider the specific case of objects moving on a ground plane. The 
objects are also assumed to have the same horizontal profile at each height. Examples of 
such objects include cylinders, cubes, cuboids, and square prisms, and can adequately 
describe the objects of interest in many applications such as people detection and track- 
ing. Let the area of their projection onto the ground plane be A 0 i,. Furthermore, we 
assume that the sensors are placed at some known heights Hi from this plane. Also, we 
define visibility to mean that the center line of the object (corresponding to the centroid 
in the horizontal profile) is visible for at least some length h from the top of the object 
(in people tracking, this might correspond to viewing the face). 

A useful quantity can be defined for the objects by considering the projection of the 
object in a particular direction. We then define r as the average, over different directions, 
of the maximum distance from the centroid to the projected object points. For e.g., for 
cylinders, r is the radius; for square prism with side 2s, r = f^ 1 ' 4 scosd d6 = 

2^/2 s/ 7 r. The quantity r will be useful in calculating the average occluding region of an 
object. Furthermore, it can easily be shown that the distance di up to which an object 
can occlude another object is proportional to its distance Di from sensor i [Fig. 1 (b)]. 
Mathematically, 



Mi 



1 I 1 ’ 

Hi + 1 



di — ( D 7 d . r ) Mi — D 



where Mi 



h 

Hi 



( 2 ) 



178 



A. Mittal and L.S. Davis 



Fixed Number of Objects. In order to develop the analysis, we start with the case 
of a fixed number k of objects in the scene under the assumption that they are located 
randomly and uniformly in region 3?. This will be extended to the more general case of 
object densities in subsequent sections. 

Under this assumption, we first estimate P(£i), which refers to the probability that 
none of the k objects is present in the region of occlusion 3i° for camera i. Assuming 
that all object orientations are equally likely 1 , one may approximate the area of this 
region of occlusion as A° « <3,(2r). Then, the probability for a single object to not be 

present in this region of occlusion is ^1 — ^ j . Since there are k objects in the scene 
located independently of each other, the probability that none of them is present in the 
region of occlusion is ( 1 — -£■ 1 . Thus: 



Pfr) = 




( 3 ) 



In order to provide this formulation, we have neglected the fact that two objects 
cannot overlap each other. In order to incorporate this condition, we observe that the 
(j + l)-th object has a possible area of only A — jA ob available to it 2 . Thus, Equation 
3 can be refined as 



k - 1 



p(£,)=n u- 



f=o 



A° 

A j A 0 b 



( 4 ) 



This analysis can be generalized to other terms in Equation 1 . The probability that the 
object is visible from all of the sensors in a specified set (*i , *2 • • • *m) can be determined 
as: 



p( n £ *> = n (i - 

*£(ii,i2,...i m ) 7=0 



A° r ■ A 

(^1 2 j • ) \ 

A jA ob J 



( 5 ) 



where A°^ i , is the area of the combined region of occlusion 'A° il , for the sensor 
set (ii, . . . •/„, ) formed by the “geometric” union of the regions of occlusion 9l° p for the 
sensors in this set, i.e. 3??. . n i 3^? ■ 



Uniform Object Density. A fixed assumption on the number of objects in a region is 
clearly inadequate. A more realistic assumption is that the objects have a certain density 
of occupancy. First, we consider the case of uniform object density in the region. This 

1 It is possible to perform the analysis by integration over different object orientations. However, 
for ease of understanding, we will use this approximation. 

2 The prohibited area is in fact larger. For example, for cylindrical objects, another object cannot 
be placed anywhere within a circle of radius 2 r (rather than r ) without intersecting the object. 
For simplicity and ease of understanding, we redefine A a b as the area “covered” by the object. 
This is the area of the prohibited region and may be approximated as four times the actual area 
of the object. 
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will be extended to the more general case of non-uniform object density in the next 
section. The uniform density case can be treated as a generalization of the “k objects” 
case introduced in the previous section. To this end, we increase k and the area A 
proportionately such that 



k = XA 



( 6 ) 



where a constant object density A is assumed. Equation 5 can then be written as 



We define: 




(7) 



( 8 ) 



Here, a captures the effect of the presence of objects and b is a correction to such effect 
due to the finite object size. Then, we obtain: 



p{ n £ *)= t >“ 



(9) 



Combining terms for j and k — j, we get 

1 



1 - 



1 - 



ka — jb J \ ka — (k — j)b y 
k 2 a 2 — k 2 ab — 2 ka + j(k — j)b 2 + bk+ 1 
k 2 a 2 — k 2 ab + j{k — j)b 2 



Assuming a^> b (i.e. the object density A is much smaller than 1 /A a t, the object density 
if the area is fully packed), we can neglect terms involving b 2 . Then, the above term can 
be written as 



1 f 2a- b 



k \a 2 — ab 

There are k/2 such terms in Equation 9. Therefore, 



P ( n « » A™ 0 - 1 (jti) 



Using the identity lim x ^ oo(l + = e, we get 

P{ f| 



k/2 



(10) 
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Non-uniform Object Density. In general, the object density (A) is a function of the 
location. For example, the object density near a door might be higher. Moreover, the 
presence of an object at a location influences the object density nearby since objects 
tend to appear in groups. We can integrate both of these influences on the object density 
with the help of a conditional density function A(x c |xo) that might be available to 
us. This density function gives the density at location x c given that visibility is being 
calculated at location xo • Thus, this function is able to capture the effect that the presence 
of the object at location xq has on the density nearby 3 . 

In order to develop the formulation for the case of non-uniform density, we note that 
the (j + l)-th object has a region available to it that is 3? minus the region occupied 
by the j previous objects. This object is located in this “available” region according to 
the density function A(). The probability for this object to be present in the region of 
occlusion R°^ i ^ can then be calculated as the ratio of the average number of people 
present in the region of occlusion to the average number of people in the available region. 
Thus, one can write: 



p( n £ *) 

(i \ , . . .i m ) 



k - 1 





A(x c |x 0 )dx c \ 



fx-x\ A(x c |x 0 )dx c J 



( 11 ) 



where % 3 , is the region occupied by the previous j objects. Since the previous j objects 
are located randomly in 31, one can simplify: 



/ A(x c |x 0 ) dx c = A avg (A - jA ob ) 

where X avg is the average object density in the region. Using this simplification in 
Equation 1 1 and noting that \ aV g A = k, we obtain: 



k - 1 



A(x c |x 0 )dx c 

p ( n *.)-*» m>- 






3=o 



^ J * A a yg ’ A ob 



(12) 



Defining: 






(* 1,-im) 



A(x c |x 0 ) dx c ’ 



6 = 



Aob * A aV g 



f xo A(x c |x 0 ) dx c ’ 



(il , . ..im . ) 



Equation 1 2 may again be put in the form of Equation 9. As before, this may be simplified 
to obtain the expression in Equation 10. 



2.2 Visibility from Multiple Sensors 

In many applications, it is desirable to view an object from more than one sensor. Stereo 
reconstruction/depth recovery is an example where the requirement of visibility from at 

3 Such formulation only captures the first-order effect of the presence of an object. While higher 
order effects due to the presence of multiple objects can be considered, they are likely to be 
small. 
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least two sensors is to be satisfied. In order to evaluate the probability of visibility from 
at least two sensors, one can evaluate: 



This term can be expanded exactly like Equation 1 treating each term (£, D £y) as a 
single entity. All the terms on the RHS will then have only intersections in them which 
are easy to compute using the formulation developed in the previous sections. 

2.3 Additional Constraints 

Other “static” constraints also affect the view of a particular camera. Therefore, the visi- 
bility probability needs to be calculated after incorporating these additional constraints. 
This is easily achieved in our scheme since the visibility constraints are analyzed at 
individual locations and additional constraints can also be verified at these locations. 
The constraints that have been incorporated in our system include: 

1. FIELD OF VIEW: Cameras have a limited field of view. At each location, it can be 
verified whether that location is within the field of view of a particular camera. 

2. OBSTACLES: Fixed high obstacles like pillars cause occlusions in certain areas. 
From a given location, it needs to be determined whether any obstacle blocks the 
view of a particular camera. 

3. PROHIBITED AREAS: There might also exist prohibited areas where people are 
not able to walk. An example of such an area is a desk. These areas have a positive 
effect on the visibility in their vicinity since it is not possible for obstructing objects 
to be present within such regions. 

4. RESOLUTION : The resolution of an object in an image reduces as the object moves 
further away from the camera. Therefore, meaningful observations are possible only 
up to a certain distance from the camera. It can easily be verified whether the location 
is within a certain “resolution distance” from the camera. 

5. ALGORITHMIC CONSTRAINTS: There are several algorithmic constraints that 
may exist. For example, stereo matching across two (or more) cameras imposes a 
constraint on the maximum distortion of the view that can occur from one camera 
to the other. This constraint can be expressed in terms of the angular separation 
between the camera centers from the point of view of the object. It can be easily be 
verified whether this constraint is satisfied at a particular location. 

6. VIEWING ANGLE: An additional constraint exists for the maximum angle a max at 
which the observation of an object is meaningful. Such observation can be the basis 
for performing some other tasks like object recognition.This constraint translates 
into a constraint on the minimum distance from the sensor that an object must be. 
This minimum distance guarantees the angle of observation to be smaller than a ma x ■ 

The analysis presented so far is probabilistic and provides “average” answers. In 
high security areas, worst-case analysis might be more appropriate. Such analysis will 
be presented in the next section. 




(14) 
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3 Worst- Case Visibility Analysis 

In this section, we present some simple results for location-specific limitations of a 
given system in the worst-case. This analysis provides conditions that guarantee visibility 
regardless of object configuration and enables sensor placement such that such conditions 
are satisfied in a given region of interest. Since the analysis is quite simple, we will only 
briefly describe these results. We propose: 

Theorem 1. Suppose there is an object 0 at location 11. If there are k point objects in 
the vicinity of 0, and n sensors have visibility of location T, then n > k + m — 1 is the 
necessary and stiff cient condition to guarantee visibility for 0 from at least m sensors. 

Proof, (a) Necessary. Suppose n <= k + m — 1. Place p = min(k, n) objects such that 
each obstructs one sensor. The number of sensors having a clear view of the object are then 
equal to n—p which is less than m (follows easily from the condition n <= fc + m — 1). 
(b) Sufficient : Suppose n > k + m — 1. 0 has n lines of sight to the sensors, k of 
which are possibly obstructed by other objects. Therefore, by the extended pigeon-hole 
principle, there must be at least n — k >= m sensors viewing 0. 

This result holds for point objects only. It can be extended to finite objects if certain 
assumptions are made. One can assume a flat world scenario where the objects and the 
sensors are in 2D. Also assume that we are given a point of interest in the object such 
that object visibility is defined as the visibility of this point of interest. This point can 
be defined arbitrarily. Let us also define an angle a as the maximum angle that any 
object can subtend at the point of interest of any other object. For example, for identical 
cylinders with the center as the point of interest, a = 60° . For identical square prisms, 
a = 90° . Under these assumptions, the above result holds if we take n to be the number 
of sensors that have visibility of location L such that the angular separation between 
any two sensors, from the point of view of L, is at least a. Also, n must be less than 
2 tv /a since it is not possible to place n > 2 tv /a sensors such that there is an angular 
separation of at least a between them. 

For a given camera configuration, one can determine the number of cameras that 
each location of interest has visibility to. This will yield the maximum number of people 
that can be present in the vicinity of the person and still guarantee visibility for him. 

4 Sensor Planning 

The visibility analysis presented in section 2 yields a function p s (x), that refers to the 
probability that an object located at location x is visible from at least one of the sensors 
that have the parameter vector s. Such parameter vector may include, for instance, the 
location, viewing direction and zoom of each camera. Given such a function, one can 
define a suitable cost function in order to evaluate a given set of sensor parameters. 
Such sensor parameters may be further constrained due to other factors. For instance, 
there typically exists a physical limitation on the positioning of the cameras (walls, 
ceilings etc.). The sensor planning problem can then be formulated as a problem of 
constrained optimization of the cost function. Such optimization will yield the optimum 
sensor parameters according to the specified cost function. 
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4.1 The Cost Function 

Several cost functions may be considered. Based on deterministic visibility analysis, one 
can consider a simple cost function that sums, over the region of interest (Rj, the number 
N(x) of cameras that a location x has visibility to: 

C(s) = - Y, iV ( x ) (15) 

xetRi 

Using probabilistic analysis, one may define a cost function that minimizes the maximum 
occlusion probability in the region: 

C(s) = max(l — p s M) 

xeoti 

Another cost function, and perhaps the most reasonable one in many situations, is to 
define the cost as the negative of the average number of visible people in a given region 
of interest: 

C( s) = - [ A(x)p s (x) dx (16) 

This cost function has been utilized for obtaining the results in this paper. 

It is also possible to integrate other constraints into the cost function. For instance, 
some of the constraints in section 2.3 may be specified as soft constraints rather than hard 
constraints (for e.g. resolution, viewing angle and algorithmic constraints). According 
to the application, any arbitrary function of the constraints may be considered: 

C(s) = f( Cl ,...cj,X0,%) 

where Cj,j = 1 ... J are the different constraints to be satisfied. 

4.2 Minimization of the Cost Function 

The cost function defined by Equation 16 (as also other suitable ones) is non-linear and 
it can be shown that it is not differentiable. Furthermore, in most non-trivial cases, it has 
multiple local minima and possibly multiple global minima. Fig. 2 illustrates the cost 
function for the scene shown in Fig. 4 (a), where, for illustration purposes, only two 
of the nine parameters have been varied. Even in this two dimensional space, there are 
two global minima and several local minima. Furthermore, the gradient is zero in some 
regions. 

Due to these characteristics of the cost function, it is not possible to minimize it 
using simple gradient-based methods that can only find the local minimum of a well- 
behaved “convex” function. Global minimization methods that can deal with complex 
cost functions are necessary [18]. Simulated Annealing and Genetic Algorithms are two 
classes of algorithms that may be considered. The nature of the cost function suggests 
that either of these two algorithms should provide an acceptable solution! 19]. For our 
experiments, we implemented a simulated annealing scheme using a highly sophisticated 
simulated re-annealing software ASA developed by L. Ingber [20]. 
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Fig. 2. The Cost Function for the scene in Fig. [4 (a)] where, for illustration purposes, only the 
x-coordinate and direction of the second camera have been varied. 



Using this algorithm, we were able to obtain extremely good sensor configurations 
in a reasonable amount of time (5min - a couple of hours on a Pentium IV 2.2GHz PC, 
depending on the desired accuracy of the result, the number of dimensions of the search 
space and complexity of the scene). For low dimensional spaces (< 4), where it was 
feasible to verify the results using full search, it was found that the algorithm quickly 
converged to a global minimum. For moderate dimensions of the search space (< 8), 
the algorithm was again able to obtain the optimum solution, but only after some time. 
Although the optimality of the solution could not be verified by full search, we assumed 
such solution to be optimum since running the algorithm several times from different 
starting points and different annealing parameters did not alter the final solution. For 
very high dimensional spaces (> 8), although the algorithm provided “good” solutions 
very quickly, it took several hours to converge to the best one. Some of the “optimal” 
solutions thus obtained will be illustrated in the next section. 



5 Simulations and Experiments 

We have proposed a stochastic algorithm for recovering the optimal sensor configuration 
with respect to certain visibility requirements. In order to validate the proposed method, 
we provide results of the algorithm for various scenes, synthetic and real. 

5.1 Synthetic Experiments 

In all the synthetic examples we consider next, we take a rectangular room of size 
10mX20m. The sensors were restricted to be mounted H = 2.5m above the ground 
and have a field of view of 90 °. We use a uniform object density A = 1 m 2 , object 
height = 1 50cm, object radius r= 1 5cm, minimum visibility height h=50cm and maximum 
visibility angle a ma x = 45°. The illustrations shown are visibility maps scaled such that 
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(a) (b) (c) 

Fig. 3. Illustration of the effect of scene geometry on sensor placement. Optimum configuration 
when (a): obstacle size is small, (b): obstacle size is big. (c): obstacle size is such that both 
configurations are equally good. 

[0,1] maps onto [0,255], thus creating a gray scale image. Brighter regions represent 
higher visibility. Note how the visibility decreases as we move away from a camera due 
to an increase in the distance of occlusion di . 

Fig. 3 illustrates the effect that an obstacle can have on camera placement. Using 
a maximum of two cameras having a field of view of 90°, the first configuration [a] 
was found to be optimum when the obstacle size was small(<60cm). Configuration 
[b] was optimum when the object size was big (>60cm). For the object size shown in 
configuration [c] (^60cm), both configurations were equally good. Note that, in both 
configurations, all locations are visible from at least one camera. Therefore, current 
methods based solely on analysis of static obstacles would not be able to distinguish 
between the two. 

Fig. 4 illustrates how the camera specifications can significantly alter the optimum 
sensor configuration. Notice that the scene has both obstacles and prohibited areas. With 
three available cameras, configuration [a] was found to be optimum when the cameras 
have only 90° field of view but are able to “see” up to 25m. With the same resolution, 
configuration [b] is optimum if the cameras have a 360° field of view (Omni-Camera). 
If the resolution is lower so that cameras can “see” only up to 10m, configuration [c] is 
optimum. 

Fig. 5 illustrates the effect of different optimization criteria. With the other assump- 
tions the same as above, configuration [a] was found to be optimum when the worst case 
analysis was utilized [Eq. 15]. On the other hand, a uniform object density assumption 
[Eq. 16] yielded configuration [b] as the optimum one. When an assumption of variable 
object densities was utilized such that the density is highest near the door and decreases 
linearly with the distance from it [d], configuration [c] was found to be the best. Note 
that a higher object density near the door leads to a repositioning of the cameras such 
that they can better capture this region. 
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Fig. 4. Illustration of the effect of different camera specifications. With a uniform density assump- 
tion, the optimum configuration when the cameras have (a): field of view of 90° and resolution 
up to 25m, (b): 360° field of view (Omni-Camera), and resolution up to 25m, (c): 360° field of 
view, but resolution only up to 10m. 
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Fig. 5. Illustration of the effect of different optimization criteria. Optimum configuration for: (a): 
worst-case analysis [Eq. 15], (b): uniform density case [Eq. 16], (c): variable density case [Eq. 
16] for the object density shown in (d). 
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Fig. 6. (a) Plan view of a room used for a real experiment, (b) and (c) are the views from the 
optimum camera locations when there is no panel (obstacle). Note that, of the three people in the 
scene, one person is occluded in each view. However, all of them are visible from at least one of the 
views. Image (d) shows the view from the second camera in the presence of the panel. Now, one 
person is not visible in any view. To improve visibility, the second camera is moved to (180, 600). 
The view from this new location is shown in (e), where all people are visible again. 



5.2 Analysis of a Real Scene 

We now present analysis of sensor placement for a real office room. The structure of 
the room is illustrated in Fig. 6 (a). We used the following parameters - uniform density 
A = 0.25m -2 , object height = 170cm, r = 23cm, h = 40cm, and a max = 60°. The 
cameras available to us had a field of view of 45 ° and needed to be mounted on the ceiling 
which is 2.5m high. In order to view people’s face as they enter the room, we further 
restricted the cameras to be placed on the wall facing the door. We first consider the case 
when there is no panel (separator). If only one camera is available, the best placement 
was found to be at location (600,600) at an angle of 135° (measured clockwise from 
the positive x-axis). If two cameras are available, the best configuration consists of one 
camera at (0,600) at an angle of 67.5° and the other camera at (600, 600) at an angle of 
132°. Figures 6 (b) and (c) show the views from the cameras. 

Next, we place a thin panel at location (300, 300) - (600, 300). The optimum con- 
figuration of two cameras consists of a camera at (0,600) at an angle of 67.5° (same as 
before) and the other camera at (180, 600) at an angle of 88°. Figures 6 (d) & (e) show 
the views from the original and new location of the second camera. 

6 Conclusion 

We have presented two methods for evaluation of visibility given a certain configuration 
of sensors in a scene. The first one evaluates the visibility probabilistically assuming 
a density function for the occluding objects. The second method evaluates worst-case 
scenarios and is able to provide conditions that would guarantee visibility regardless of 
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object configuration. Apart from obtaining important performance characterization of 
multi-sensor systems, such analysis was further used for sensor planning by optimization 
of an appropriate cost function. The algorithm was tested on several synthetic and real 
scenes, and in many cases, the configurations obtained were quite interesting and non- 
intuitive. The method has applications in surveillance and can be utilized for sensor 
planning in places like museums, shopping malls, subway stations and parking lots. 
Future work includes specification of more complex cost functions, investigation of 
more efficient methods for optimization of the cost function and better estimation of 
visibility probability by considering the effect of interaction between objects. 
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Abstract. In this paper, a new camera calibration algorithm is pro- 
posed, which is from the quasi-affine invariance of two parallel circles. 
Two parallel circles here mean two circles in one plane, or in two parallel 
planes. They are quite common in our life. 

Between two parallel circles and their images under a perspective pro- 
jection, we set up a quasi-affine invariance. Especially, if their images 
under a perspective projection are separate, we find out an interesting 
distribution of the images and the virtual intersections of the images, 
and prove that it is a quasi-affine invariance. 

The quasi-affine invariance is very useful which is applied to identify the 
images of circular points. After the images of the circular points are iden- 
tified, linear equations on the intrinsic parameters are established, from 
which a camera calibration algorithm is proposed. We perform both sim- 
ulated and real experiments to verify it. The results validate this method 
and show its accuracy and robustness. Compared with the methods in 
the past literatures, the advantages of this calibration method are: it is 
from parallel circles with minimal number; it is simple by virtue of the 
proposed quasi-affine invariance; it does not need any matching. 
Excepting its application on camera calibration, the proposed quasi- 
affine invariance can also be used to remove the ambiguity of recovering 
the geometry of single axis motions by conic fitting method in [8] and 
[9]. In the two literatures, three conics are needed to remove the ambi- 
guity of their method. While, two conics are enough to remove it if the 
two conics are separate and the quasi-affine invariance proposed by us is 
taken into account. 



1 Introduction 

Camera calibration is an important task in computer vision whose aim is to esti- 
mate the camera parameters. Usually, camera self-calibration techniques without 
prior knowledge on camera parameters are nonlinear [4], [13], [15]. It can be lin- 
earized if some scene information is taken into account during the process of 
calibration. Therefore, it has been appearing a lot of calibration methods us- 
ing scene constraints [2], [3], [5], [10], [11], [12], [14], [18], [19], [20], [23], [24], 
[25]. Usually, the used information in the scene is parallels, orthogonality, or 
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the known angles, circles and their centers, concentric conics et al. For exam- 
ple, in [14], the images of circular points are determined when there is a circle 
with several diameters in the scene, then the linear constraints on the intrinsic 
parameters are set up. In [2], by using the parallel and orthogonal properties 
of the scene, the constraints on the projective matrix are given. Parallelepipeds 
with some known angles and length ratios of the sides are assumed existed, then 
from them the equations on the intrinsic parameters are established in [20] . [25] 
presents a calibration method using one-dimensional objects. 

Our idea in this paper is also to use the scene information to find the con- 
straints on the intrinsic parameters of cameras. Two circles in one plane or in two 
parallel planes, called two parallel circles, are assumed to be in the scene, and 
then a quasi-affine invariance of them is found. Based on the invariance, camera 
calibration is investigated, and a new algorithm is proposed. Compared with the 
previous methods, this method has the following advantages: it is from parallel 
circles with minimal number; it is simple by virtue of the proposed quasi-affine 
invariance; it does not need any matching. 

The parallel circles are quite common in our life, and then this calibration 
method can be applied. It can also be used to solve the ambiguity for recovering 
the geometry of single axis motions in [8], [9]. The two literatures have shown 
that the geometry of single axis motion can be recovered given at least two 
conic loci consisting of corresponding image points over multiple views. If the 
two conics are separate or enclosing, the recovery has a two fold ambiguity, the 
ambiguity is removed by using three conics in the literatures. In fact, if the two 
conics are separate, it is enough to remove the ambiguity only from the two 
conics by taking into account the quasi-affine invariance presented in this paper. 

On the other hand, in [16], Quan gave the invariants of two space conics. 
When the two conics are parallel circles, this invariants cannot be set up, but a 
quasi-affine invariance proposed in this paper indeed exists. Actually, the imaging 
process of a pinhole camera is quasi-affine [6], [7], the proposed quasi-affine 
invariance is very useful. 

The paper is organized as follows. Section 2 is some preliminaries. Section 
3 uses a quasi-affine invariance of two parallel circles to establish the equations 
on the camera intrinsic parameters, and gives a linear algorithm for calibrating 
a camera from these equations. Then, the invariance and algorithm are vali- 
dated from both simulated and real experiments in Section 4. Conclusions and 
acknowledgements are remarked in Section 5 and 6 respectively. 

2 Preliminaries 

In this paper, denotes the equality up to a scale, a capital bold letter 
denotes a matrix or a 3D homogeneous coordinates, a small bold letter denotes 
a 2D homogeneous coordinates. 

Definition 1. If two circles in space are in one plane, or in two parallel planes 
respectively, we call them two parallel circles. 
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Fig. 1 . Parallel circles. Ci and C3 are coplanar, C2 is in the plane parallel to the plane 
containing Ci and C3. Any two of them are two parallel circles 



See Fig. 1, Ci, C 2 , C 3 are parallel circles each other. 

Under a pinhole camera, a point X in space is projected to a point x in the 
image by: 



x«K[R,t]X, (1) 

where K is the 3x3 matrix of camera intrinsic parameters, R is a 3 x 3 rotation 
matrix, t is a 3D translation vector. The goal of calibrating a camera is to find 
K from images. 

The absolute conic consists of points X = (Xi,.X 2 , X 3 , 0) at infinity such 
that: 

Xl + Xl + Xjj = 0, or, X r X = 0, 

and its image w is: 

x r K” r K _1 x = 0. (2) 

If some points on u> can be inferred from image, the equations on the intrinsic 
parameters can be set up by (2). If the number of these equations is enough, 
the intrinsic parameters will be determined. In the following, we are to find the 
points on w by using two parallel circles in the scene. 

Some preliminaries on projective geometry are needed, the readers can refer 
to the details in [17]. Every real plane other than the plane at infinity, denoted 
by P , intersects the plane at infinity at a real line, called the line at infinity 
of P, denoted by L$. Lq intersects the absolute conic at a pair of conjugate 
complex points, called the circular points of P. Every circle in P passes through 
the circular points of P. Let Ci and C 2 be two parallel circles, Pi and P 2 be 
the parallel planes containing them. Because Pi and P 2 have the same line at 
infinity, they have the same pair of circular points. Therefore, Ci and C 2 pass 
through the same pair of circular points, they and the absolute conic form a 
coaxial conic system at the two circular points (A coaxial conic system means a 
set of conics through two fixed points). 

A quasi-affine transformation lies part way between a projective and affine 
transformation, which preserves the convex hull of a set of points, and the relative 
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positions of some points and lines in a plane, or the relative positions of some 
points and planes in 3D space. For the details, see [7] or Chapter 20 in [6]. 



3 New Calibration Method from the Quasi-afhne 
Invariance of Two Parallel Circles 



Under a pinhole camera, a circle is projected to a conic. Moreover, because K, 
R, t in (1) are real, a real point is projected to a real point, and a pair of 
conjugate complex is projected to a pair of conjugate complex. So the images of 
a pair of circular points must still be a pair of conjugate complex. 

If there are three or more than three parallel circles in the scene, then from 
their images, the images of a pair of circular points can be uniquely determined 
without any ambiguity by solving for the intersection points of the three image 
conics [8], [21]. If there are only two ones in the scene, denoted by Ci, C 2 , whose 
images are denoted by c l5 c 2 , whether the images of a pair of circular points can 
be uniquely determined or not depends on the relative positions of Ci and c 2 . 
The equations for ci, c 2 are two quadric equations, the number of their common 
solutions over complex field is four with multiplicity. If there are real solutions 
among these four ones, or Ci and c 2 have real intersections, then there is a unique 
pair of conjugate complex among these four solutions, which must be the images 
of the pair of circular points. If Ci and c 2 have no real intersection, these four 
solutions are two pairs of conjugate complex. Which pair is the images of the 
circular points? We will discuss it in the following. 




Fig. 2. Two cases that ci and C2 have no real intersection: the left side is the separate 
case; the right side is the enclosing case 



If the relative positions of the camera and circles in the scene are in general, 
or, the circles lie entirely in front of the camera, the images of these circles are 
ellipses. From now, we always regard that the circles in the scene are entirely in 
front of the camera. Then, when Ci and c 2 have no real intersection, there are 
two cases as shown in Fig. 2, one case is that ci and c 2 separate; another case is 
that C! and c 2 enclose. For the enclosing case, we can not distinguish the images 
of the circular points between the two pairs of conjugate complex intersections 
of Ci and c 2 [21]. While, for the separate case, we can distinguish them by a 
quasi-affine invariance. 
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Firstly, a lemma with respect to two coplanar circles is needed. In order to 
distinguish notations of two coplanar circles from the above notations Ci and 
C 2 of two parallel circles, we denote two coplanar circles as C 3 and C 3 . 

Lemma 1. If Cd and C 3 are two coplanar separate circles, their homogeneous 
equations have two pairs of conjugate complex common solutions. We connect 
the two points in each pair of the conjugate complex common solutions, then 
obtain two real lines, called the associated lines of Ci and C 3 . One of the two 
associated lines lies between C 3 and C 3 , and the other one, which is the line at 
infinity passing through the circular points, does not lie between Ci and C 3 as 
shown in Fig. 3. 

The proof of Lemma 1 is given in Appendix. 




Fig. 3. Two coplanar separate circles Ci, C 3 , and their associated lines Li and Lq 
(the line at infinity). Ci and C 3 intersect at two pairs of conjugate complex points, 
one pair is on the line Li; another pair, which is the pair of circular points, is on the 
line at infinity Lq. Li lies between Ci and C 3 , while, Lq does not 



Theorem 1. If c 1; c 2 are the images of two parallel circles and separate, their 
homogeneous equations have two pairs of conjugate complex common solutions. 
We connect the two points in each pair of the conjugate complex common solu- 
tions, then obtain two real lines, called the associated lines of c 3 and c 2 . One 
of the two associated lines lies between Ci and c 2 , and the other one does not 
lie between c 3 and c 2 as shown in Fig. f. If the camera optical center does not 
lie between the two parallel planes containing the two circles, the associated line 
not lying between Ci and c 2 is the vanishing line through the images of circular 
points. Otherwise, the associated line lying between Ci and c 2 is the vanishing 
line through the images of circular points. 
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Fig. 4. The distributions of ci, C 2 , and their two associated lines li and lo. ci and C 2 
intersect at two pairs of conjugate complex points, one pair is on the line li; another 
pair is on the line lo- li lies between ci and C2, while, lo does not. The images of 
circular points are the pair on lo if the optical center does not lie between the two 
parallel planes containing the two circles. Otherwise, they are the pair on li 



Proof. Let the two parallel circles be Ci, C 2 , and Pi, P 2 be the planes containing 
them, O be the camera optical center. Because Pi, P 2 are parallel, the quadric 
cone with O as its vertex and passing through C 2 intersects the plane Pi at a 
circle, denoted by C3. Ci and C3 are two coplanar circles in Pi. Because Ci and 
c 2 are separate, and are also the images of Ci and C3, we know that Ci and C3 
are separate too. See Fig. 5 . By Lemma 1 , there is the fact: one of the associated 
lines of Ci and C 3 lies between C x and C 3 (denoted by L ), and the other one, 
i.e. the line at infinity passing through the circular points, does not lie between 
Ci and C3 (denoted by L 0 ). 

If O does not lie between Pi and P 2 , we know that Ci, C3 in Pi are all in 
front of the camera. Because under a pinhole camera, the imaging process from 
the parts of Pi in front of the camera to the image plane is quasi-affine ([ 7 ], 
Chapter 20 in [6]), the relative positions of ci, c 2 and their associated lines are 
the same as the ones of Ci, C3, L , Lq. So the associated line not lying between 
Ci and c 2 is the images of Lq, i.e. the vanishing line through the images of the 
circular points. 

If O lies between Pi and P 2 , the plane through O and L 0 , denoted by P 0 , 
which is parallel to Pi and P 2 , lies between Ci and C 2 . And, the plane through 
O and L, denoted by P, does not lie between Ci and C 2 . This is because C 3 
and Ci lie on the different sides of P, and also C3 and C 2 lie on the different 
sides of P. The projection from Ci, C 2 , Po, P to their images is quasi-affine, so 
ci, c 2 and their associated lines have the same relative positions as the ones of 
C l5 C 2 , Po, P (The image of Po is the vanishing line lying between Ci and c 2 , 
the image of P is the associated line not lying between Ci and c 2 ). 

Then, the theorem is proved. 

Therefore, if Ci and c 2 are separate, by Theorem 1 , we can find out the 
images of a pair of circular points. If Ci and c 2 are enclosing, and their two 
pairs of conjugate complex intersections do not coincide, we can not find out the 
images of circular points now (if the two pairs of conjugate complex intersections 
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Fig. 5. The camera and two parallel circles Ci, C 2 . O is the camera optical center. The 
quadric cone passing through O (as the vertex) and C2 intersects the plane containing 
Ci at another circle C 3 . Ci and C 3 are coplanar. If the images of Ci and C 2 are 
separate, Ci and C 3 are separate too because the image of C 3 is the same as the 
image of C2 



coincide to one pair, the coinciding pair is the images of circular points, and at 
the time, Ci, C 3 are concentric) [21]. 

In fact, the enclosing case of tq and c 2 usually seldom occurs, and other 
cases of ci and c 2 occur quite often in our life. We regard that Ci and c 2 are not 
enclosing below. 

By the discussion in the second paragraph in this section and Theorem 1, the 
images of a pair of circular points can always be determined from a single view 
of two parallel circles. Assuming the images of the determined circular points 
to be rrij, nij, by (2), we have two linear equations on the camera intrinsic 
parameters w = K~ r K~ 1 as: 

m T jui m j = 0, irijW m 7 = 0. (3) 

If the camera intrinsic parameters are kept unchanged and the motions between 
cameras are not pure translations, then from three views, six linear equations 
on the intrinsic parameters can be set up. Thus, the camera can be calibrated 
completely. 

An outline of our algorithm to calibrate a camera from the images of two 
parallel circles is showed as follows. 

Step 1. In each view, extract the pixels u of the images of two parallel circles, 
and fit them with u T CiU = 0, u T c 2 u = 0 to obtain Ci and c 2 by the least 
squares method, then establish two conic equations as ei : x r cix = 0, 
and e 2 : x r c 2 x = 0. 

Step 2. Solve the common solutions of ei, e 2 in each view. 

Step 3. Find out the images of the circular points from the solved common 
solutions of ei, e 2 by the method presented in this section. 
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Step Set up the equations on the intrinsic parameters u> = K _T K _1 from the 
images of circular points found out in Step 3 by (3). 

Step 5. Solve out u> from the equations in Step 4 by singular value decomposition 
method, and then do Cholesky decomposition and inverse the result, or 
use the equations in [22], to obtain the intrinsic parameters K. 

Remark 1. With the notations O, L , Lq as in the proof of Theorem 1, let P be 
the plane through O and P, P 0 be the plane through O and L 0 . By the proof 
of Theorem 1, we know that the relative positions of Ci, C 2 , P, Po are the 
same as the ones of their images, which just is a quasi-affine invariance. For 
other cases except for the enclosing case, the images of circular points are found 
out by the real projective invariance preserving the real and conjugate complex 
intersections of conics respectively. Of course, the projective invariance is also a 
quasi-affine invariance. 

Remark 2. In Theorem 1, there are two cases: one case is that the optical center 
does not lie between the two parallel planes Pi, P 2 containing the two circles; 
another case is that the optical center lies between Pi and P 2 . In general, the 
former case occurs more often than the latter case. When the two circles are 
coplanar, the optical center always does not lie between Pi and P 2 , the associated 
line of Ci and C 2 not lying between Ci and C 2 is always the vanishing line. 

Remark 3. If we use the above method with a calibration grid to calibrate camera 
in the same way as Zhang’s method [24], it might be wise to take two intersecting 
coplanar circles. 



4 Experiments 



4.1 Simulated Experiments 

In the experiments, the simulated camera has the following intrinsic parameters: 



K = 



1500 3 512 

0 1400 384 

0 0 1 



Take two parallel circles in the world coordinates system as: X 2 + Y 2 = 
6 2 , Z = 0; ( X — 20) 2 + Y 2 = 3 2 , Z = 10. And, take three groups of rotation 

axes, rotation angles and translations as: ri = (17, 50, 40) T , 9\ = 0.37r,ti = 

(—5, 15, 50) T ; r 2 = (-50, 50, 160) T , d 2 = 0.l7r,t 2 = (10,-4,40) T ; r 3 = 
(90, -70, 20^,03 = 0.27r,t 3 = (5, 2, 30) T . Let R, be the rotations from r^, 9i. 
Then project the two circles to the simulated image planes by the three projective 
matrices P, = K[R.,, t*], i = 1,2,3 respectively. The images of the two circles are 
all separate (in order to verify Theorem 1), and the image sizes are of 700 x 900, 
550 x 950, 500 x 850 pixels respectively. Gaussian noise with mean 0 and standard 
deviation ranging from 0 to 2.0 pixels is added to the image points of the two 
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Table 1 . The averages of the estimated intrinsic parameters under different noise levels 



Noise levels (pixel) 


fu 


fv 


s 


Uo 


^0 


0 


1500.0000 


1400.0000 


3.0000 


511.9999 


384.0000 


0.4 


1500.4522 


1400.4622 


3.0655 


513.0418 


384.7355 


0.8 


1500.3927 


1399.8069 


2.6278 


518.8032 


389.0022 


1.2 


1500.9347 


1399.9088 


2.8893 


525.2412 


392.4559 


1.6 


1501.8536 


1399.4255 


3.2259 


537.6676 


399.6847 


2.0 


1503.9747 


1399.7978 


2.4428 


548.5837 


410.4403 



Table 2. The RMS errors of the estimated intrinsic parameters under different noise 
levels 



Noise levels (pixel) 


U 


fv 


s 


Uo 


Vo 


0 


0.0000 


0.0000 


0.0000 


0.0000 


0.0000 


0.4 


5.1775 


4.7679 


0.8985 


5.1834 


5.2046 


0.8 


11.1786 


10.2244 


1.8713 


11.2057 


11.2147 


1.2 


15.6606 


14.3364 


2.7034 


15.6643 


15.9711 


1.6 


21.1434 


19.9497 


3.0504 


21.4699 


21.1630 


2.0 


26.6784 


24.8587 


4.8918 


27.6128 


27.0722 




Fig. 6. The standard deviations of the estimated intrinsic parameters vs. (a) the num- 
ber of images; (b) the distance of the two parallel circles (defined to be the distance of 
the two parallel planes containing the two circles) 
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circles, and then the intrinsic parameters are computed. For each noise level, we 
perform 50 times independent experiments, and the averaged results are shown in 
Table 1. We also compute the root mean square errors (RMS errors) of intrinsic 
parameters under different noise levels, the results are given in Table 2. 

In order to assess the performance of our calibration technique with the num- 
ber of images and with the distance of the two parallel circles, the calibrations 
using 4, 5, 6, 7 images and varying the distance of the two circles are performed 
respectively, and the standard deviations of the estimated intrinsic parameters 
are shown in Fig. 6, where the added noise level is 0.5 pixels. It is clear that 
the deviations tend to decrease with the number of the images increasing. Let 
d be the distance of the two parallel planes containing the two circles (the two 
circles used are: X 2 + Y 2 = 10 2 , Z = 0 and ( X — 40) 2 + Y 2 = 10 2 , Z = d. d 
is varying from 0 to 240). Then we can see that: (i) the deviations for f u , f v , 
uo, Vo tend to decrease with d increasing from 0 to 50, and then to increase with 
d increasing from 50 to 240; (ii) the deviations for s tend to increase with d 
increasing. It follows that it is not the coplanar circles such that the algorithm 
is most stable. 




Fig. 7. The used three images of two parallel circles 



4.2 Real Experiments 

We use a CCD camera to take three photos of two cups as shown in Fig. 7. The 
photos are of 1024 x 768 pixels. In each photo, the pixels of the images of the 
upper circles at the brim of the two cups are extracted, then fitted by the least 
squares method to obtain two conic equations (see Step 1 of our algorithm) . From 
Fig. 7, we can see that the extracted conics are separate in each view. Applied 
Theorem 1 and the proposed calibration algorithm to these conic equations, the 
estimated intrinsic parameter matrix is: 



Kr 



1409.3835 8.0417 568.2194 

0 1385.3772 349.3042 

0 0 1 



To verify Ki, the classical calibration grid DLT method in [1] is used to 
calibrate the same camera (the intrinsic parameters keep unchanged). The used 
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Fig. 8. The used images of a calibration grid 



image is the left one in Fig. 8, and the calibration result from 72 corresponding 
pairs of space and image points is: 



K 2 



1325.6124 4.5399 500.7259 

0 1321.2270 368.4573 

0 0 1 



The estimated intrinsic parameters Ki, K 2 are used to reconstruct the cali- 
bration grid from the two images in Fig. 8. The angles between two reconstructed 
orthogonal planes are: 



89.28° by using Ki, 89.97° by using K 2 . 



Both of them are close to the ground truth of 90°. Consider the reconstructed 
vertical parallel lines on the calibration grid, then compute the angles between 
any two of them, and the averages are: 

0.0000476° by using K 1; 0.0000395° by using K 2 . 

Both of them are close to the ground truth of 0°. These results validate the 
proposed algorithm in this paper. 



5 Conclusions 

We presented a quasi-affine invariance of two parallel circles, then applied it 
to calibrating a camera. Both simulated and real experiments were given, and 
showed the accuracy and robustness of this method. The presented quasi-affine 
invariance is quite interesting and useful. It can also be applied to recovering the 
geometry of single axis motions by conic fitting method. We believe that it will 
have more applications in future. 
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Appendix: Proof of Lemma 1 

Ci , C 3 are coplanar and separate, then we can set up the Euclidean coordinate 
system as: one of the centers of Ci and C 3 as the origin O, the line through the 
two centers as the X-axis, the line through O and orthogonal to the X-axis as 
the F-axis, the radius of one of the circles as the unit length. For example, we 
take the coordinate system as in Fig. 3. Then the homogeneous equations of Ci 
and C 3 are respectively: 

X 2 + Y 2 = Z 2 , (X - X 0 Z) 2 + Y 2 = R 2 Z 2 (4) 

where R is the radius of C 3 , Xo horizontal coordinate of the center of C 3 . 
Because Ci and C 3 separate, X 0 > 1 + R. Solve the common solutions of (4), 
and compute the associated lines, then we have them as: 

2 j^2 i -y 

Li : X = — - , L 0 : Z = 0 (the line at infinity) 

2X 0 

Because Xo > 1 + R, we can prove that the following inequality holds: 

1 < X °~f + 1 <X 0 -R<oo 
2A o 

From the inequality, we know that Ci , C 3 lie on the different sides of Li. Since 
Lo is at infinity, Ci, C 3 must lie on the same side of it as shown in Fig. 3. 

In addition, the above proof is independent of the chosen Euclidean coordi- 
nate system. This is because if we set up another Euclidean coordinate system 
(with the same or different unit length as the above one), they can be transformed 
each other by a Euclidean transformation, or a similarity transformation, which 
preserves the line at infinity and preserves the relative positions of objects. 
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Abstract. We study the problem of object, in particular face, recognition under 
varying imaging conditions. Objects are represented using local characteristic 
features called textons. Appearance variations due to changing conditions are 
encoded by the correlations between the textons. We propose two solutions to 
model these correlations. The first one assumes locational independence. We call 
it the conditional texton distribution model. The second captures the second order 
variations across locations using Fisher linear discriminant analysis. We call it 
the Fisher texton model. Our two models are effective in the problem of face 
recognition from a single image across a wide range of illuminations, poses, and 
time. 



1 Introduction 

Recognition under varying imaging conditions is a very important, yet challenging prob- 
lem. Imaging conditions can change due to external and internal factors. External factors 
include illumination conditions (back-lit vs. front-lit or overcast vs. direct sunlight) and 
camera poses (frontal view vs. side view). Internal variations can arise from time (natural 
material weathers or rusts, or people aging) or internal states (facial expressions or a 
landscape changing appearance according to the season). The changes an object exhibits 
under varying imaging conditions are usually referred to as within-class variations in 
pattern recognition. 

The ability to be invariant to within-class variations determines how successful an 
algorithm will be in practical applications. In recent years, a lot of attention in the 
research community has been devoted to this problem. Some representative examples 
and their application domains are (1) generic 3-D objects [1 1]; (2) faces [ 1,2, 3, 6,7]; and 
(3) natural materials [4,10,14]. 

In this paper, we strive to develop algorithms to recognize objects, in particular faces, 
under varying imaging conditions. The fundamental observation comes from human 
vision. After seeing many objects under different conditions, humans build an implicit 
internal model of how objects change their appearance. Using this model, humans can 
hallucinate any object’s appearance under novel conditions. For example, one can easily 
recognize a person from the side after seeing only a single frontal picture of this person. 
Or, one can still recognize a friend with ease after not seeing him for 10 years. Of 
course, recognition is not always perfect, especially under some unusual conditions, but 
the accuracy is significant. 

We adopt a learning framework to build a model of how the appearance of objects 
change under different imaging conditions. We call it the texton correlation model. 



T. Pajdla and J. Matas (Eds.): ECCV 2004, LNCS 3021, pp. 203-214, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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Textons are a discrete set of representative local features for the objects. The basic idea 
is to encode efficiently how the textons transform when illumination, camera pose, etc... 
change. Taking into account these transformations, we can build a similarity measure 
between images which are insensitive to imaging conditions. Using the texton correlation 
model, our algorithms can recognize faces from a single image of a person under a wide 
range of illuminations and poses, and also after many years of aging. 

The outline of this paper is as follows. The concept of textons is reviewed in Section 2. 
Assuming locational independence, we propose a solution to capture the within-class 
variations using the conditional texton distribution. Experimental results using the con- 
ditional texton distribution are presented in Section 4. In Section 5, we introduce the 
idea of Fisher textons to capture second-order correlations across both pixel locations 
and imaging conditions. Results using Fisher textons are also presented. Finally, we 
conclude and discuss future work in Section 6. 



2 Textons 

Julesz [9] first proposed to use the term texton to describe the putative units of preatten- 
tive human texture perception. Julesz’s textons — orientation elements, crossings, and 
terminators — lack a precise definition for gray level images. Recently, the concept of 
textons has been re-invented and operationalized. Leung and Malik [10] define textons 
as learned co-occurences of outputs of linear oriented Gaussian derivative filters. Varia- 
tions of this concept have been applied to the problem of 3D texture recognition [4,10, 
14]. We adopt a similar definition of textons in this paper. 

What textons encode is a discrete set of local characteristic features of a 3D surface 
in the image space. This discrete set is referred to as the vocabulary. Every location on 
the image is mapped to an element in this vocabulary. For example, if the 3D surface is a 
human face, one texton may encode the appearance of an eye, another the mouth corner. 
For natural materials such as concrete, the textons may encode the image characteristic 
of a bar, a ridge, or a shadow edge. The textons can be learned from a single class (e.g. 
John, or concrete), thus forming a class-specific vocabulary. It can also be learned from 
a collection of classes (e.g. {John, Mary, Peter, . . .}, or {concrete, velvet, plaster, . . .}), 
thus forming a universal vocabulary. One advantage of the discrete nature of the texton 
representation is the ability to characterize changes in the image due to variations in 
imaging conditions easily. For example, when a person changes from a smiling expres- 
sion to a frown, the mouth corner may change from texton element I to element J. Or 
when the illumination moves from a frontal direction to an oblique angle, the element 
on a concrete surface may transform from texton element A to element B. The main 
focus of this paper is to study how to represent this texton element transformation to 
recognize objects and materials under varying imaging conditions. 

In this paper, textons are computed in the following manner. First, the image is 
filtered with a filterbank of linear Gaussian derivative filters. Specifically, the filters are 
the horizontal and vertical derivatives of circular symmetric Gaussian filters: 



F v(cr) = ^-G a {x,y) 
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F H (ct) = -^G a (x,y) 
ay 

1 x ‘ ^ 4- ip 

Ga( x ,y) = ^: ex P( 

4 different scales are used, giving a total of 8 filters. The particular choice of filters is not 
very important 1 . This set is selected for their simplicity and ease of computation. In fact, 
the filters are x-y separable and filtering can be done in O(N) time instead of 0(N 2 ) 
time, where N is the dimension of the kernel. After filtering, each pixel is transformed 
into a vector of filter responses of length 8. These filter responses are clustered using 
the K-means algorithm [5] to produce K prototypical features for the objects. With this 
vocabulary, every pixel in an image is mapped to the closest texton element according 
to the Euclidean distance in the filter output space. We call the output of this process the 
texton labels for an image. The value at each pixel is now between 1 and K, depending 
on which texton best describes the local surface characteristics. 



3 Conditional Texton Distribution 

We represent the texton transformation in a probabilistic formulation. The objective 
here is to capture how objects, faces, or natural materials change their appearance under 
varying imaging conditions. The goal is to learn the intrinsic transformation which is 
valid for all the instances within an object class. For example, in the context of face 
recognition, we would learn the transformation from a training set of a large group 
of people. The intrinsic variations within a single person and the differences between 
individuals are captured in the model. This learned transformation can be applied to any 
group of novel subjects and recognition can be achieved from a single image. 

Let M be the image of a model. For example, in face recognition, M will be an 
image of the person you want to recognize. Let I be an incoming image. The task is 
to determine whether I is the same object as M. Let Tm be the texton labels for M 
and Tj be that of I. We define Psame{Ti\T m) to be the probability that I is the same 
object as the model M. Similarly, we define P^^T^Tm) to be the probability that it 
is a different object. The likelihood ratio can be used to determine whether they come 
from the same object: 



L{T!\T m ) 



Psame{Ti\TM ) 

P diffi T i\ T M) 



( 1 ) 



The task is to define Psame(' p i\' p M) and p difj^ Pl \ rp M)- We make the simple as- 
sumption that the texton labels are independent of their location: 



Psame(Tx\TM) — \\_PsameiTi{ x )\ r PM{ x )) 

X 

P diff( T i\TM) = l[P dijff (T I (x)\T M (x)) 

X 

1 Other filter choices can be found in [10,13,14] 
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The likelihood ratio can be used as a similarity measure between an image and the model. 
We can either set a threshold on L(Tj\Tm) to determine whether the face matches the 
model, or as in classification, assign the incoming image to the class with the highest 
likelihood ratio score, L. 

3.1 Learning the Distribution from Data 

The discrete nature of textons allows us to represent the distributions P(Tj(x)\Tm{x)) 
exactly, without making simplifying assumptions such as a Gaussian distribution. No- 
tice that 'l'r(x) is an element of the texton vocabulary and is scalar- valued: T/(x) £ 
{1, . . . , K}. In fact, with a texton vocabulary of size K, P(Tj(x)\Tm{x)) can be rep- 
resented completely as an K x K table. This conditional probability table can be easily 
learned through training data. 

Let the training set be T. Let Cm be the set of all training data that belong to the 
same class as M. Let a, b £ {1, . . . , K} be two texton elements in the vocabulary. The 
entries in the probability table can be accumulated as follows 2 : 

Psame(Ti = o|Tm = b) = — ^ l( atb< c M )(Ti,T M , I) 

1 MjeT 

P diff( Tl = u \ Tm = b ) = 7^ X! 

2 M,I&T 

where Z\ and Z 2 are normalizing constants to make Psame and P c Hff probabilities. The 
function l( a biCM )(T/, T M , C/) = 1 if T r = a,T M = b,I £ C M and 0 otherwise. 
1 (a, b, Cm) = 1 if T z = a,T M = b,I gC M and 0 otherwise. 

Applying these two learned conditional probability tables to the likelihood ratio 
L(Tj\Tm ) in Eq. 1, the similarity between a model and an incoming image can be 
computed. 

4 Experiments 

In this section, we will describe results of applying the conditional texton distribution 
model to the problem of face recognition. Before images are compared, faces are au- 
tomatically extracted by a face detector [8]. Eyes and mouth corners are found using 
a similar algorithm to normalize each face for size and in-plane rotation. Each face is 
resampled to a standard size of 30 x 30 pixels. In all the experiments in this paper, 
separate texton vocabularies are learned independently at each pixel location. The main 
reason for this choice is the speed needed to compute texton assignment. For each pixel, 
a vocabulary size of 10 is used. In total, there are 30x30x10 = 9000 textons altogether. 

In all the experiments, the training set is used to obtain the texton vocabularies and 
the conditional texton distributions. All results are reported on a disjoint test set, in 
which none of the individuals appear in the training set. Results will be reported on two 
applications: face verification and face classification. First, we describe the databases 
used in this paper. 

2 The x dependency is implied implicitly to make the notations more readable. 
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4.1 Database 

There are 3 face databases used to evaluate the performance under different imaging 
condition: 

Yale face database: Using a geodesic dome with 64 flashes, researchers at Yale Uni- 
versity collected a database of objects and human faces [2]. The set of face images 
is used in this paper to evaluate the performance of our algorithm towards vary- 
ing illuminations. There are 15 individuals. Images in which the illumination angle 
is within ±60° in elevation and azimuth are used. These 32 images are shown in 
Figure 1 . This database will be referred to as yale in this paper. 




Fig. 1 . The Yale face database. 32 illumination directions are used. The top two rows correspond to 
illumination directions randomly chosen to form the training set. The bottom two rows correspond 
to those in the test set. 



FERET: The FERET training database [12] consists of faces of 1203 individuals, each 
with several frontal images. For a subset of the individuals (670 people), non-frontal 
images are available. We break up this database into two sets to measure the per- 
formance of our algorithm separately. The Frontal FERET consists of faces with 
different expressions and slightly different natural lighting. Robustness of our al- 
gorithm towards out-of-plane rotations (up to ±45°) will be measured using the 
Rotated FERET subset. 

ID photos: This database consists of employee ID photos at a Japanese company. There 
are 836 individuals, with a total of 1834 images. All the employees are Asians, with 
the majority Japanese. Every individual has at least 2 images, one taken at the time of 
hire and another taken recently 3 . For each individual, there is a large age difference 
in the different photographs — usually several years, up to as many as 20 years. This 
is a challenging database because people can change their appearance significantly 
over the span of several years: from wearing glasses to wearing contact lenses, from 
skinny to fleshy, from having a lot of hair to bald, etc. This database will be able to 
test how our algorithm performs when people change their appearance when aging. 
Since these are ID photographs, lighting is well-controlled, though not constant 

3 Some have one more photo taken during their employment. 
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across all photographs. Expression is usually neutral. This database will be referred 
to as ID in this paper. Due to privacy issues, only photos of 2 individuals can be 
shown in Figure 2. 




Fig. 2. Examples of ID photos. Two different photographs of the same individual can be taken up 
to 20 years apart. This database can test our algorithm's performance against large age differences. 



4.2 Face Verification 

We first consider the problem of face verification. One application of face verification 
is an access control security system. A user inputs his/her identity through an ID card. 
A camera will capture an image of the user. The face will be detected and matched 
against the one previously stored in the system. If the match is good, the user will be 
granted access, otherwise denied. The basic concept is to compare two faces and decide 
whether it is the same person. In all our experiments, we randomly select two faces from 
the test set. These two faces can come from two different individuals or from the same 
individual 4 . Accuracy is measured by the false positive and false negative rates, in the 
form of a ROC (receiver operating characteristics) curve. All experiments are repeated 
using random partitions of the databases into training and test sets. Results reported are 
the average performance. 

We first study the performance of our algorithm with respect to illumination variations 
using the yale database. 10 subjects are used for training and the remaining 5 subjects for 
testing. 16 illumination directions are chosen randomly and the corresponding images are 
used as training. These illumination directions are the same for every training individual. 
The corresponding images for one person are shown in the top two rows in Figure 1 . The 
bottom two rows show the illumination directions for the test set. There is a complete 
disjoint between the training set and the test set. In other words, none of the illumination 
directions in the test set is present in the training set for any individual. Our algorithm 
performs perfectly in this experiment, getting 0% false positive and 0% false negative. 
Notice that the two images to be verified can have light directions up to 120° apart. This 
means the texton distribution does a very good job encoding the intrinsic changes in 
feature appearance under different illuminations. The learned distribution extrapolates 
well for both new individuals and novel illumination directions. 

4 For the case of the same individual, the two images are never identical. 
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The performance of our algorithm on the ID, Frontal FERET, and Rotated FERET 
databases is shown in Figure 3. Figure 3(b) is the zoomed-in version of (a). The blue, red, 
and green curves are the ROC curves for the Frontal FERET, Rotated FERET, and ID 
databases respectively. The equal error rate (false positive rate equals to false negative 
rate) for the three experiments are 4.2%, 10.5%, and 9% respectively. 

The dashed black curve in Figure 3 is the ROC curve on the Frontal FERET database 
with a system trained using ID pictures. The idea is to see how well the learned texton 
distributions generalize to a different database. Most machine learning algorithms guar- 
antee good generalization only if the statistics of the training and test sets is identical. 
In this case, the statistics can be quite different, for example, the ethnic composition of 
the subjects are totally different: ID contains pictures of predominantly Japanese, while 
Frontal FERET contains pictures with diverse ethnic backgrounds. 



Face Verification Results Face Verification Results (zoomed-in) 




(a) 



(b) 



Fig. 3. Face verification results. The blue, red, and green curves are the ROC curves for the Frontal 
FERET, Rotated FERET, and ID databases respectively. For each of these three experiments, the 
training and test sets come from the same database. The dashed black curve indicates the ROC 
curve for the Frontal FERET database using a system trained with the ID photos. 



4.3 Face Classification 

In this section, we investigate the effectiveness of our conditional texton distribution 
model for the problem of face classification. Let there be P individuals, each with a single 
image as the model. For any new image of these P people, we want to automatically 
determine who he is. We will use the similarity measure in Section 3 (Equation 1) and 
classify this new image into the model with the highest similarity score. 

For all the databases, we randomly pick P individuals from the test set. For each 
person, one image is randomly selected to be the model, another to be the probe. The 
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model and the probe can be of vastly different imaging conditions. This procedure is 
repeated multiple times for different choices of P individuals and different random 
training and test sets. Our algorithm works perfectly for the yale database, giving 0% 
error. The results for the other databases are reported in Figure 4. The blue, red, and green 
curves report the error rates for the Frontal FERET, Rotated FERET, and ID databases 
respectively. In these three cases, the training and test sets come from the same database. 
We would like to emphasize that for all the experiments, only a single image is used 
for the model. This is a very difficult problem, especially for the Rotated FERET case, 
because the probe can be up to 90° out-of-plane rotated from the model. The black curve 
in Figure 4 presents the error rate on Frontal FERET with a system trained using the ID 
photos. This indicates the effectiveness of our algorithm when the statistics of the test set 
(predominantly Japanese) is different from that of the training set (ethnically diverse). 



Face Classification 




Fig. 4. Face classification results. The model consists of one image. The blue, red, and green curves 
represent the error rates for the Frontal FERET , Rotated FERET , and ID databases respectively. 
The black curve represents the error rate for the Frontal FERET with a system trained using ID 
photos. 



5 Fisher Textons 

The conditional texton distribution model presented in Section 3 makes the assumption 
that the texton assignments are independent of location. This is obviously a wrong as- 
sumption. For example, the appearance of the left eye and the right eye are definitely 
correlated. However, this assumption enables us to compute the likelihood ratio (Equa- 
tion 1 ) efficiently. 

In this section, we explore the correlation between locations on the face. Specifically, 
we take into account second order correlations. After texton assignment, every face is 
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turned into a 30 x 30 array of texton labels. Each pixel, Ti(x), takes on the value between 
1, ... ,K, where K is the size of the texton vocabulary at each pixel 5 . We transform each 
pixel to an indicator vector of length K: [0 , . . . , 0 , 1 , 0 , . . . , 0] with 1 at the k th element 
if Tj(x) = k. We concatenate all the vectors together, so that each image becomes a 
30 x 30 x 10 = 9000 dimensional vector. 

We perform the Fisher linear discriminant analysis [5] on these vectors to obtain the 
projection directions which are best for separating faces from different people. Specifi- 
cally, let the within-class scatter matrix be: 
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where c is the number of classes and Ct is the set of training examples belonging to class 
i. rii = \Ci\ and n = ■ n t . The between-class scatter matrix is defined as: 
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where m is the total mean vector: 
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The objective is to find V to maximize the following criterion function: 



J(V) 



IV'SbV] 

|V*S„V| 



The columns of the optimal V are the generalized eigenvectors corresponding to the 
largest eigenvalues in 



Sb^i — ^ iS w Vi 

The vectors are the projection directions which capture the essential information to 
classify objects among different classes. The idea is that when a large number of training 
examples are used, the Vi s can distinguish between people not only those present in the 
training set. 

We call these projection vectors Vi the Fisher textons. Every incoming image is trans- 
formed into an N dimensional vector by projecting into these Fisher textons. Similarity 
between two faces is taken simply as the Euclidean distance between the projections. 

5 K = 10 in this paper. 
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Face Verification Results using Fisher Textons Face Verification Results using Fisher Textons (zoomed-in) 




Fig. 5. Face verification results using Fisher textons. The blue, red, and green curves are the ROC 
curves for the Frontal FERET, Rotated FERET, and ID databases respectively. For each of these 
three experiments, the training and test sets come from the same database. The dashed black 
curve indicates the ROC curve for the Frontal FERET database using a system trained with the 
ID photos. 



Let us contrast the differences between the Fisher textons and the conditional texton 
distribution. For Fisher textons, locational correlations are captured up to second order. 
However, imaging condition correlations are captured only up to the second order as 
well. On the other hand, the texton conditional distribution model sacrifices location de- 
pendencies to capture the exact texton distributions under changing imaging conditions. 

The results on the problem of face verification are shown in Figures 5. The blue, 
red, and green curves are the ROC curves for the Frontal FERET, Rotated FERET, and 
ID databases respectively. The equal error rates are 2%, 5.5%, 6.6% respectively. The 
dashed black curve indicates the ROC curve for the Frontal FERET database using a 
system trained with the ID photos, with an equal error rate of 9%. The performance on 
the face verification task is uniformly better than that produced by the conditional texton 
distribution model. The added locational dependencies more than offset the sacrifice 
made on the imaging condition correlations. 

Results for face classification using Fisher textons are shown in Figure 6. The per- 
formance is better for the Frontal FERET (blue curve) and Rotated FERET (red curve) 
databases. However, it is worse for ID (green curve) and Frontal FERET database when 
trained using ID photos (black curve). One possible explanation is that the within-class 
variations in the ID database is so large (because of the long timeline) that just capturing 
the second order correlations is not enough to distinguish individuals well. Using the 
whole distribution, as in the case of the conditional texton distribution model, can thus 
produce better results. 
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Face Classification using Fisher Textons 




Fig. 6. Face classification results using Fisher Textons. The blue, red, and green curves represent 
the error rates for the Frontal FERET, Rotated FERET, and ID databases respectively. The black 
curve represents the error rate for the Frontal FERET with a system trained using ID photos. 



6 Conclusions 

In this paper, we study the problem of recognition under varying imaging conditions. 
Two algorithms are proposed based on the idea of textons. The first algorithm is to use 
conditional texton distributions to model the within-class and between-class variations 
exactly. But, the assumption of locational independence is made. We call the second al- 
gorithm Fisher textons. Second order correlations in both location and imaging condition 
variations are captured. Both algorithms are effective in the problems of face verification 
and recognition. 

Comparing with state-of-the-art algorithms is difficult without training and testing 
on the same datasets. Future work includes thorough comparisons with other algorithms. 
Another direction for future work is to develop algorithms to capture the exact depen- 
dencies from both pixel locations and changing imaging conditions. The obvious choice 
is a Markov Random Field. However, the parameters in MRFs are difficult to estimate 
without a large quantity of data. Inference is not a trivial task either. Finding efficient 
ways to capture the correlations will be an interesting problem from both theoretical and 
practical points of view. 
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Abstract. We present a method for learning feature descriptors us- 
ing multiple images, motivated by the problems of mobile robot nav- 
igation and localization. The technique uses the relative simplicity of 
small baseline tracking in image sequences to develop descriptors suit- 
able for the more challenging task of wide baseline matching across sig- 
nificant viewpoint changes. The variations in the appearance of each 
feature are learned using kernel principal component analysis (KPCA) 
over the course of image sequences. An approximate version of KPCA is 
applied to reduce the computational complexity of the algorithms and 
yield a compact representation. Our experiments demonstrate robustness 
to wide appearance variations on non-planar surfaces, including changes 
in illumination, viewpoint, scale, and geometry of the scene. 



1 Introduction 

Many computer vision problems involve the determination and correspondence 
of distinctive regions of interest in images. In the area of robot navigation, a 
mobile platform can move through its environment while observing the world 
with a video camera. In order to determine its location, it must create a model 
that is rich enough to capture this information yet sparse enough to be stored 
and computed efficiently. By dealing with only sparse image statistics, called 
features, these algorithms can be made more efficient and robust to a number 
of environmental variations that might otherwise be confusing, such as lighting 
and occlusions. Usually, these features must be tracked across many images to 
integrate geometric information in space and time. Thus, one must be able to 
find correspondences among sets of features, leading to the idea of descriptors 
which provide distinctive signatures of distinct locations in space. By finding 
features and their associated descriptors, the correspondence problem can be 
addressed (or at least made simpler) by comparing feature descriptors. 

In applications such as N-view stereo or recognition, it is frequently the case 
that a sparse set of widely separated views are presented as input. For such wide 
baseline problems, it is necessary to develop descriptors that can be derived from 
a single view, since no assumptions can be made about the relative viewpoints 
among images. In contrast, in the cases of robot navigation or real-time structure 
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from motion, a video stream is available, making it possible to exploit small base- 
line correspondence by tracking feature locations between closely spaced images. 
This provides the opportunity to incorporate multiple views of a feature into its 
signature, since that information is present and easily obtained. Under these cir- 
cumstances, the problem of tracking across frames is relatively simple, since the 
inter-frame motion is small and appearance changes are minimal. Many existing 
feature trackers, such as [12,35,37], can produce chains of correspondences by 
incrementally following small baseline changes between images in the sequence. 
These trackers, however, are far from ideal and often drift significantly within a 
handful frames of motion, thus they cannot be used for wide baseline matching. 
In order to reduce this effect, features must be recognized despite a variety of ap- 
pearance changes. Additionally, when a sudden large change in viewpoint occurs, 
it is necessary to relate features that have already been observed to their coun- 
terparts in a new image, thus maintaining the consistency of the model that the 
features support. In this paper, we propose to integrate the techniques of short 
baseline tracking and wide baseline matching into a unified framework which can 
be applied to correspondence problems when video streams are available as input. 

One may question the necessity of such a multi-view descriptor, given that 
many single view features work well (see the next section for an overview). In 
some cases, in fact, multiple views will add little to the descriptor matching 
process, since the variability can be well modelled by a translational or affine 
transformation. When these assumptions are violated, such as when there are 
non-planar surfaces in a scene, or complex 3D geometry, multiple views of a fea- 
ture can render significant robustness to viewpoint changes. When the geometry 
itself is changing, such a descriptor is necessary to capture this variability. A 
multiple view descriptor can provide a generic viewpoint, meaning the results of 
matching will be less sensitive to particular viewpoints (which may be confusing 
special cases) . Perhaps the most compelling argument in favor of the multi- view 
approach is that, in many applications, the data is already there. When this is 
the case, it makes sense to try to leverage the data available, rather than discard 
it after processing each frame. 



1.1 Related Work 

The problem of finding and representing distinctive image features for the pur- 
poses of tracking, reconstruction, and recognition is a long-standing one. Re- 
cently, a number of authors (Schmid et al [7,23], Lowe [21,20], Baumberg [1], 
Tuytelaars and Van Gool [38,39], Schaffalitzky and Zisserman [28,29]) have de- 
veloped affine invariant descriptors of image locations. These expand upon the 
pioneering work of Lindeberg [19], Koenderink [15], and others who study image 
scale space and its properties. The general approach of these methods is to find 
image locations which can be reliably detected by searching for extrema in the 
scale space of the image [19]. Given different images of a scene taken over small 
or wide baselines, such descriptors can be extracted independently on each pair, 
then compared to find local point correspondences. 
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Lowe’s scale invariant feature transform (SIFT) [20] considers an isotropic 
Gaussian scale space, searching for extrema over scale in a one-dimensional scale 
space. The difference-of-Gaussian function is convolved with the image, com- 
puted by taking the difference of adjacent Gaussian-smoothed images in the 
scale space pyramid. Once the extrema are located (by finding maxima and 
minima in neighborhoods of the scale space pyramid), they are filtered for sta- 
bility then assigned a canonical orientation, scale and a descriptor derived from 
gradient orientations in a local window. The descriptor is adapted from Edelman 
et al [8] , which models the outputs of “complex cells” that respond to particular 
image gradients within their receptive fields. In a similar manner, SIFT samples 
gradients around the points of interest and amalgamates them into a 4x4 array 
of 8-bin orientation histograms. This constitutes the 128 element SIFT descrip- 
tor. SIFT has been shown to match reliably across a wide range of scales and 
orientation changes, as well as limited 3D perspective variation [24]. 

Mikolajczyk and Schmid’s affine invariant interest point detector [23] seeks 
to find stable areas of interest in the affine image scale space, which has three 
parameters of scale. It first selects initial feature locations using a multi-scale 
Harris corner detector [12] then applies an iterative technique to find the best 
location, scale, and shape transformation of the feature neighborhood. The pro- 
cedure converges to yield a point location in the image, as well as a canonical 
transformation which can be used to match the feature despite arbitrary affine 
transformations of its neighborhood. Descriptors consist of normalized Gaussian 
derivatives computed on patches around the points of interest. These patches 
are transformed by the canonical mapping used in the detection phase of the 
algorithm, which yields scale and skew invariance. Rotational invariance is ob- 
tained using steerable filters, and affine photometric invariance is achieved by 
normalizing all of the derivatives by the first. A dimension 12 descriptor is the 
final output of the procedure, involving derivatives up to 4th order. 

Tuytelaars and Van Gool [39] have developed methods which explicitly take 
into account a variety of image elements, such as corners, edges, and inten- 
sity. They find and characterize affinely invariant neighborhoods by exploiting 
properties of these elements, then match similar points using geometric and pho- 
tometric constraints to prune false matches. An explicit assumption they make 
is that the areas of interest selected lie on approximately planar regions, though 
their experiments demonstrate robustness to violations of this assumption. 

The aforementioned works (which represent only a small fraction of the latest 
literature on the topic, see also [1,7,9,5,12,14,15,20,21,22,23,26,27,28,29,34,36,38, 
39,40] for some others) focus on the problem of extracting properties from local 
patches in a single image in order to match these locations in subsequent images 
of the same scene. In [9], Ferrari et al present a method for matching features 
across multiple unordered views, which is derived from pairwise view matches 
using the detector described in [39]. When a video stream is available, as often 
is the case when using cameras on mobile robots, more information is present 
than can be obtained from a single image of a scene or an arbitrary set of 
such images. It is therefore reasonable to seek a multi-view descriptor , which 
incorporates information from across multiple adjacent views of a scene to yield 
better feature correspondences. 
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1.2 Rationale and Overview of the Approach 

We address the wide-baseline correspondence problem under the specific scenario 
of autonomous guidance and navigation, where high frame-rate video is available 
during both training (“map building”) and localization, but viewing conditions 
can change significantly between the two. Such changes affect both the domain 
of the image (geometric distortion due to changes of the viewpoint and possibly 
deformations of the scene) and its range (changes in illumination, deviation from 
Lambertian reflection). If w : R 2 — > R 2 denotes a piecewise smooth function, 
and I : fi C R 2 — > R + denotes the image, then in general two given images 
are related by 12(2;) = p(Ii(w(x))) where p is a functional that describes range 
deformations, and w describes domain deformations. Such changes are due to 
both intrinsic properties of the scene £ (shape, reflectance) and to nuisance 
factors v (illumination, viewpoint), so we write formally p = p^ tV (I) and w = 
vj£^(x). The goal of the correspondence process is to associate different images 
to a common cause (the same underlying scene £) despite the nuisances u. 

A “feature” is a statistic of the image, </>:/—>■ M. k that is designed to facili- 
tate the correspondence process. 1 Ideally one would want a feature statistic that 
is invariant with respect to all the nuisance factors: (j> o p^ v (I{w^ v (x))) = /{(x) 
independent of z', for some function / and for all allowable nuisances v. Unfor- 
tunately this is not possible in general, since there exists no single-view statis- 
tic that is invariant with respect to viewpoint or lighting conditions, even for 
Lambertian scenes. Nuisances that can be moded-out in the representation are 
called invertible . 2 What nuisance is invertible depends on the representation of 
the data. If we consider the data to be a single image, for instance, viewpoint 
is not an invertible nuisance. However, if multiple adjacent views of the same 
scene are available, as for instance in a video from a moving camera, then view- 
point can be explicitly accounted for, at least in theory. Additionally, changes 
in viewpoint elicit irradiance changes that are due to the interplay of reflectance 
and illumination, and it is therefore possible that “insensitive” (if not invariant) 
features can be constructed. This is our rationale for designing feature descrip- 
tors that are based not on single views, but on multiple adjacent views of the 
same scene. 

Any correspondence process relies on an underlying model of the scene, 
whether this is stated explicitly or not: our model of the scene is a constellation 
of planar patches that support a radiance density which obeys a diffuse+specular 
reflection model. This means that for patches that are small enough one either 
sees the Lambertian albedo, or a reflection of the light source, and therefore 
the rank of an aggregation of corresponding views is limited [13]. We represent 

1 Facilitation should be quantified in terms of computational complexity, since the ben- 
efit of using features to establish correspondence is undermined by Rao-Blackwell’s 
theorem, that guarantees that a decision based on any statistic of the data achieves 
a conditional risk that is no less than the decision based on the raw data. 

2 Euclidean, affine, and projective image motion are invertible nuisances, in the sense 
that they can be eliminated by pre-processing the image. However, viewpoint is not 
invertible, unless the scene has special symmetries that are known a priori. 
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the multi-view descriptor by a rank-constraint in a suitable inner product of a 
deformed version of the image: rank(T) = r, where the tensor T is defined by 

T = (<£(/iOi(x))), <p(l 2 (w 2 {x))), $(I n (w n (x)))) (1) 

for a suitable function that maps the image to a higher-dimensional space. 3 
The modeling “responsibility” is shared by the transformation w and the map 
the more elaborate the one, the simpler the other. What is not modeled explicitly 
by w and <P goes to increase the rank of T. In this general modeling philosophy, 
our approach is therefore broadly related to [6,10,11]. In this work, we consider 
the following combinations: 

Translational domain deformation, generic kernel: We use the simplest 
possible w(x) = x + T, a generic map and all the responsibility for mod- 
eling deformations of the domain and the range is relegated to the principal 
components of the tensor T. In this case, the descriptor is invariant with 
respect to plane-translations, but all other deformations contribute to the 
rank r ~ n. 

Affine domain deformation, generic kernel: w(x) = Ax + T, and addi- 
tional geometric distortion due to chages of viewpoint and photometric vari- 
ability is relegated to the principal components of T- 
Viewpoint deformation: In this case, w(x) depends on the 3-D structure of 
the scene, and is explicitly inverted by projective reconstruction and nor- 
malization. Therefore, T is viewpoint invariant and its principal components 
only models photometric variability. 



2 Proposed Solution 

We relegate deformations in the images not accounted for by the transforma- 
tion w to the principal components of T. Principal component analysis (PC A) 
operates on the premise that a low dimensional basis suffices to approximate 
the covariance matrix of the samples, thereby providing a compact represen- 
tation. Given M observation images, PCA diagonalizes the covariance matrix 
C = jj i y 3 yj (where y j can be considered a vectorized image patch, and 
without loss of generality we assume that y is pre-processed to be zero mean) by 
solving an eigenvalue equation [2] . The Karhunen-Loeve (KL) transform is an ef- 
ficient method to compute the basis (principal components) , which can be carried 
out using singular value decomposition (SVD) [3] . For the case where we have a 
stream of incoming data, the sequential Karhunen-Loeve algorithm exploits the 
low dimension approximation characteristic by partitioning and transforming the 
data into blocks of orthonormal columns to reduce computational and memory 

3 We use $ to indicate the map from the data to what is known in the kernel-machine 
community as “feature space” and rf> for the feature that we have defined in this 
section (i.e. an invariant image statistic). The notation should not be confusing in 
the end, since will comprise principal components of 
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requirements [18,4]. In other words, this algorithm essentially avoids the com- 
putation of a full-scale SVD at every time step by only applying the necessary 
computation to smaller data blocks for updating the KL basis incrementally. 



2.1 Incremental Update of Kernel Principal Components 

Since PCA aims to find an optimal low dimensional linear subspace that min- 
imizes the reconstruction error, it does not perform well if the data points are 
generated from a nonlinear manifold. Furthermore, PCA encodes the data based 
on second order dependencies (pixel- wise covariance among the pixels), and ig- 
nores higher-order statistics including nonlinear relations among the pixel inten- 
sity values, such as the relationships among three or more pixels in an edge or a 
curve, which can capture important information for recognition. 

The shortcomings of PCA can be overcome through the use of kernel func- 
tions with a method called kernel principal component analysis (KPCA). In 
contrast to conventional PCA which operates in the input image space, KPCA 
performs the same procedure as PCA in a high dimensional space, F, related 
to the input by the (nonlinear) map F : — > F , y i — > Y . 4 If one considers 

y £ R w to be a (vectorized) image patch, Y £ F is this image patch mapped into 
F. The covariance matrix for M vectors in F is C' = Xq=i ^(yj)^(yj) T j as- 

suming Y^Jk=\ ^(yk) = 0 ( see [31] for a method to center F(y)). By diagonalizing 
C", a basis of kernel principal components (KPCs) is found. As demonstrated in 
[30], by using an appropriate kernel function A;(x,y) = (#(x),^(y)), x,y £ R N , 
one can avoid computing the inner product in the high-dimensional space F. 
The KPCs are implicitly represented in terms of the inputs (image patches) y, 
the kernel k, and a set of linear coefficients 0, as F = ]>T =1 A^(yi)> & £ F. 

To choose an appropriate kernel function, one can either estimate it from 
data or select it a priori. In this work, we chose the Gaussian kernel 

fe(w,y) =exp(- ^ W 2o J^ ) (2) 

based on empirical study, though a principled yet computationally expensive 
method for learning the kernel from data using quadratic programming was 
recently demonstrated by Lanckriet et al. [17]. 

Unfortunately, there is no “online” version of KPCA, as exists for standard 
PCA [4,18]. In order to avoid computations in the high-dimensional space, all 
computations are performed through the kernel in terms of linear combinations 
of input vectors. Hence, in traditional KPCA, all of the input vectors (image 
patches) must be stored in order to perform classification. This is unacceptable 
for feature descriptors, since the storage requirement is high, the computational 
complexity grows with more examples, and the representation is not compact. 

4 F is typically referred to as “feature space,” but to avoid confusion we will refrain 
from using that name. 
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To avoid this problem, we have employed an online (i.e. constant time and 
memory) version, approximate KPCA, which continually produces a fixed num- 
ber of approximations of the input patches. These approximations and their 
expansion coefficients, along with the kernel function, form a compact repre- 
sentation of the KPCA basis in high-dimensional space, which can be used to 
compare an observed feature with this descriptor. The technique is based on the 
one of Scholkopf et al [33] to find approximate “pre-images” of vectors implicitly 
represented in high-dimensional space. Given a vector B = ]Ci=i we 

seek an approximation B* = X]f=i with L < M. For a particular pre- 

image z, [33] demonstrates that it is sufficient to minimize the distance between 
B and its projection onto ^(z), which is equivalent to maximizing 

<*(z),*(z)>- 1 j 

By differentiating and substituting the Gaussian kernel function for inner prod- 
ucts, the following fixed-point iteration for z is obtained: 

(4, 

where Y is a matrix of input vectors y l (image patches), a is the vector of 
coefficients, K” is the vector of kernels [£:(yi, z” ),..., k(y^, z ra )] T , and •* is the 
element-wise multiplication operator. In order to find a number of such approxi- 
mations, zi, ...z m, we set B m+1 = B m — (3 m B( z m ), where z m is found using (4). 
One can solve for the optimal coefficients of the expansion, /?, in each iteration 
to yield B* = X)f=i A^( z ») [32]. 

In order to match a newly observed image to existing descriptors, our algo- 
rithm searches the image for patches which have a small residual when projected 
onto the stored KPCA descriptors. That is, it finds y, a patch from the new im- 
age, and ip, a descriptor (KPCA basis), such that the following is minimized for 
a choice of ip and y. 



N 

&(y) - 

2=1 



(<P(y),ipi) 

{ipi,ipi) 




( 5 ) 



where ipi is a kernel principal component and N is the number of components 
in the descriptor. 



2.2 Feature Descriptors through Incremental Update of KPCA 

Our method for extracting feature descriptors from image sequences proceeds as 
follows: 

1. Bootstrap with a small-baseline tracker: Read a number of frames of 
the input sequence, track the features using a standard tracking method, and 
store the image patches of each feature. As a translation-invariant tracker, 
w(x) = x + T, we use Lukas and Kanade’s [37] classic algorithm; for affine- 
invariant tracker, w(x) = Ax+T, we use the Shi and Tomasi (ST) algorithm. 
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2. Construct kernel basis: Perform KPCA using the Gaussian kernel sepa- 
rately on each feature’s training sequence or reduced training set found in 
step 3. 

3. Approximate kernel basis: Form an approximate basis for each feature 
by finding approximate patches which lead to the least residual estimate of 
the original basis in high-dimensional space. Create L such patches for each 
feature. In our algorithm, L is a tuning parameter. Further discussion of 
“pre-image” approximation in KPCA can be found in [32,33]. 

The above algorithm yields a set of descriptors, each corresponding to a 
particular feature. Given a novel image of the same scene, these descriptors can 
be matched to corresponding locations on the new image using (5). 

3 Experiments 

We performed a variety of experiments with the proposed system to test the effi- 
cacy of small and wide baseline correspondence. In the experiments, two phases 
were established: training and matching , which correspond to short and wide 
baseline correspondence. In the training phase, a video sequence was recorded of 
a non-planar object; this object underwent a combination of 3D rotation, scal- 
ing, and warping with respect to the camera. The Shi-Tomasi (ST) [35] tracker 
(or Lucas-Kanade (LK), in the case of translational tracking) was used to obtain 
an initial set of points, then the procedure of the previous section was used to 
track these locations and develop feature descriptors via approximate KPCA. 
Note that we do not show experiments for projective reconstruction, since we 
did not see any benefit to this approach using our data sets. In future experi- 
ments with more severe depth variations, we expect to see significant benefits 
by normalizing the projective reconstruction. 

In the matching phase, a test image from outside the training sequence was 
used to find wide-baseline correspondences. First, initial features were selected 
using the ST or LK selection mechanism. The purpose of this was to find the 
most promising locations based on the same criteria used in the tracking phase. 
In the case of affine tracking, the ST tracking algorithm then performed affine 
warping in a neighborhood around each candidate point, seeking to minimize 
the discrepancy between this point and each stored in the training phase. In 
the translational case, no such warping was applied. The quality of a candidate 
match was calculated by finding the projection distance of this patch onto the 
basis of the descriptor using (5). Finally, candidate matches that fell below a 
threshold distance were selected, and the best among those was chosen as the 
matching location on the test image. 

The results for matching are displayed in the figures using an image of the 
training sequence for reference, but the matching process does not use that 
image. Rather, it matches the descriptors derived from all of the video frames. 
Any image in the training sequence could be used for this visualization. 

It is important to note a few aspects of our experiments. First, we are only 
concerned with feature selection or tracking insofar as they influence the ex- 
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perimental quality. Any consistent feature selector and tracker can be used to 
estimate the candidate points, or even no feature tracker at all (the entire image 
or a neighborhood could be searched, for example). For computational reasons, 
we used the ST selector and tracker for the experiments, which proved robust 
enough for our purposes. 

A number of tuning parameters are present in the algorithms. For the LK 
and ST trackers, one must choose a threshold of point selection, which bounds 
the minimum of the eigenvalues of the second moment matrix of the image 
around the point. This was set, along with a minimum spacing between points 
of 10 pixels, such that about 50 points were selected on the object of interest. 
During tracking, a pyramid of scales is used to impose scale invariance. We 
chose to calculate three levels of this pyramid in all experiments, hence the 
pyramid included the image and downsampled versions 1/2 and 1/4 of its original 
size. For the KPCA algorithm, the tuning parameters are S, the size of the 
image patches, N, the number of kernel principal components to keep, L, the 
number of approximate patches to keep (to support the approximate kernel 
principal component basis), and <r, the bandwidth of the Gaussian kernel. We 
used S' = 31 x 31, N = 8 for the translational case, N = 3 for the affine case, 
L = 4, and a = 70. These were selected by experiment. 

While we did not optimize our experiments for speed or programming effi- 
ciency, we found the average time for tracking between adjacent frames to be ap- 
proximately 10 seconds. The code was executed in Matlab on a 1.7GHz Pentium 
IV processor. The video frames and test images were 640x480 8-bit greyscale 
pixels, and about 50 feature locations were tracked every frame. The wide base- 
line matching had similar time requirements, taking up to twenty minutes to 
match 50 features, since a brute force combinatorial search was used. Optimiza- 
tions in compiled code, as well as search heuristics, would increase these speeds 
dramatically. We have developed a C-| — F version of the tracking phase of this 
code, which runs at speeds of greater than 15Hz on the same hardware. 

The figures in the following pages show a selection of many experiments. 
In all figures, lines link corresponding points between views. The results were 
pruned of matches outside the relevant objects to make the figures less cluttered. 
When comparing the choice of tracker, we found that affine tracking required 
fewer principal components than translational tracking to produce similar cor- 
respondence rates. When attempting to match scenes that are rotated or scaled 
with respect to the training sequence, the affine tracking scheme has a clear ad- 
vantage, since these domain transformation are explicitly accounted for by the 
tracking mechanism. Such variability could not be represented in the transla- 
tional case unless it was observed in the training sequence. 

4 Concluding Remarks 

We have presented a novel method for extracting feature descriptors from image 
sequences and matching these to new views of a scene. Rather than derive invari- 
ance completely from a model, our system learns the variability in the images 
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Fig. 1. Affine tracking + KPCA: A non-planar surface undergoing warping and 
viewpoint changes. (Top-left) The first image of the training sequence. (Top-right) 
The test image, outside of the training sequence. 27 feature locations were correctly 
matched, with 2 false positives. (Bottom-left) Image from the training sequence. 
(Bottom-right) The warped object. 40 locations correctly matched, 6 false positives. 



directly from data. This technique is applicable in situations where such data is 
already available, such as robot navigation, causal structure from motion, face 
tracking, and object recognition. 

There are a number of ways in which our system can be extended. Because a 
kernel technique is used, we must approximate the basis comprising the feature 
descriptor with virtual inputs. The best way to do this remains an open problem, 
and we are investigating other methods in addition to the one presented here ([16, 
25], for example). When tracking and matching, we use the ST selector to provide 
an initial guess for feature locations. While convenient, this may not be the best 
choice, and we are experimenting with the use of more modern feature selectors 
([20,23,28]). In cases where the observed scene is known to be rigid, robust 
structure from motion techniques (RANSAC or a robust Kalman Filter, for 
example) can be used to remove incorrect correspondences and suggest potential 
feature locations. Finally, the experimental code must be translated into more 
efficient form, allowing it to be used in near real-time on mobile platforms. 
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Fig. 2. Translational, Affine, and SIFT: The figures show the results of matching 
using our algorithms and SIFT on a rotating can. (Left) The translational Multi- View 
technique correctly matches a number of features near the edges of the can. There 
are 22 correct correspondences and 8 false positives. (Center) The affine Multi-View 
matches 32 locations and zero outliers, but with fewer matches along the edges. (Right) 
SIFT correctly matches more locations overall, but toward the center portion of the 
can where the transformation is approximately affine. Some fines between matches are 
included for clarity. 




Obs image 



One Training image 



Fig. 3. Translational + KPCA: Matching using a non-planar surface undergoing 
warping and scale changes. During training, the object was moved away from the 
camera and deformed. (Left) shows the first image of the training sequence. (Right) 
shows the test image, which is outside of the training sequence. The shapes and colors 
indicate the corresponding points. In this example, 50 feature locations were correctly 
matched, with 10 false positives. 
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Abstract. In this paper we describe a novel technique for detecting 
salient regions in an image. The detector is a generalization to affine 
invariance of the method introduced by Kadir and Brady [10]. The de- 
tector deems a region salient if it exhibits unpredictability in both its 
attributes and its spatial scale. 

The detector has significantly different properties to operators based on 
kernel convolution, and we examine three aspects of its behaviour: invari- 
ance to viewpoint change; insensitivity to image perturbations; and re- 
peatability under intra-class variation. Previous work has, on the whole, 
concentrated on viewpoint invariance. A second contribution of this pa- 
per is to propose a performance test for evaluating the two other aspects. 
We compare the performance of the saliency detector to other stan- 
dard detectors including an affine invariance interest point detector. It 
is demonstrated that the saliency detector has comparable viewpoint 
invariance performance, but superior insensitivity to perturbations and 
intra-class variation performance for images of certain object classes. 



1 Introduction 

The selection of a set of image regions forms the first step in many computer 
vision algorithms, for example for computing image correspondences [2,17,19,20, 
22], or for learning object categories [1,3,4,23]. Two key issues face the algorithm 
designer: the subset of the image selected for subsequent analysis and the repre- 
sentation of the subset. In this paper we concentrate on the first of these issues. 
The optimal choice for region selection depends on the application. However, 
there are three broad classes of image change under which good performance 
may be required: 

1. Global transformations. Features should be repeatable across the expected 
class of global image transformations. These include both geometric and pho- 
tometric transformations that arise due to changes in the imaging conditions. 
For example, region detection should be covariant with viewpoint as illustrated 
in Figure 1. In short, we require the segmentation to commute with viewpoint 
change. 

2. Local perturbations. Features should be insensitive to classes of semi-local 
image disturbances. For example, a feature responding to the eye of a human face 
should be unaffected by any motion of the mouth. A second class of disturbance is 



T. Pajdla and J. Matas (Eds.): ECCV 2004, LNCS 3021, pp. 228-241, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 



An Affine Invariant Salient Region Detector 229 



% ♦ s 


• r r 


H 


_ m ~ 


• • ► 


& w 




' $ % 


1 Feature 


I Feature 


1 Detection 


1 Detection 


H 


0 0 €, 

o & 




<S?o <$> 

V 


00 0 



Fig. 1 . Detected regions, illustrated by a centre point and boundary, should commute 
with viewpoint change - here represented by the transformation H. 



where a region neighbours a foreground/background boundary. The detector can 
be required to detect the foreground region despite changes in the background. 
3. Intra-class variations. Features should capture corresponding object parts 
under intra-class variations in objects. For example, the headlight of a car for 
different brands of car (imaged from the same viewpoint). 

In this paper we make two contributions. First, in Section 2 we describe ex- 
tensions to the region detector developed by Kadir and Brady [10]. The exten- 
sions include covariance to affine transformations (the first of the requirements 
above), and an improved implementation which takes account of anti-aliasing. 
The performance of the affine covariant region detector is assessed in Section 3 
on standard test images, and compared to other state of the art detectors. 

The second contribution is in specifying a performance measure for the two 
other requirements above, namely tolerance to local image perturbations and to 
intra-class variation. This measure is described in Section 4 and, again, perfor- 
mance is compared against other standard region operators. 

Previous methods of region detection have largely concentrated on the first 
requirement. So-called corner features or interest points have had wide appli- 
cation for matching and recognition [7,21]. Recently, inspired by the pioneering 
work of Lindeberg [14], scale and affine adapted versions have been developed 
[2,18,19,20]. Such methods have proved to be robust to significant variations 
in viewpoint. However, they operate with relatively large support regions and 
are potentially susceptible to semi-local variations in the image; for example, 
movements of objects in a scene. They fail on criterion 2. 

Moreover, such methods adopt a relatively narrow definition of saliency and 
scale; scale is usually defined with respect to a convolution kernel (typically a 
Gaussian) and saliency to an extremum in filter response. While it is certainly 
the case that there are many useful image features that can be defined in such a 
manner, efforts to generalise such methods to capture a broader range of salient 
image regions have had limited success. 
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Fig. 2. (a) Complex regions, such as the eye, exhibit unpredictable local intensity hence 
high entropy. Image from NIST Special Database 18, Mugshot Identification Database. 
However, entropy is invariant to permutations of the local patch (b). 



Other methods have extracted affine covariant regions by analysing the image 
isocontours directly [17,22] in a manner akin to watershed segmentation. Related 
methods have been used previously to extract features from mammograms [13]. 
Such methods have the advantage that they do not rely on excessive smoothing 
of the image and hence capture precise object boundaries. Scale here is defined in 
terms of the image isocontours rather than with respect to a convolution kernel 
or sampling window. 

2 Information Theoretic Saliency 

In this section we describe the saliency region detector. First, we review the 
approach of Kadir and Brady [10], then in Section 2.2 we extend the method to 
be affine invariant, and give implementation details in Sections 2.3 and 2.4. 

2.1 Similarity Invariant Saliency 

The key principle underlying the Kadir and Brady approach [10] is that salient 
image regions exhibit unpredictability, or ‘surprise’, in their local attributes and 
over spatial scale. The method consists of three steps: I. Calculation of Shannon 
entropy of local image attributes (e.g. intensity or colour) over a range of scales 
II. Select scales at which the entropy over scale function exhibits a 
peak s p ; III. Calculate the magnitude change of the PDF as a function of 
scale at each peak Wd(s). The final saliency is the product of Hd{s ) and 
Wd(s) at each peak. The histogram of pixel values within a circular window of 
radius s, is used as an estimate of the local PDF. Steps I and III measure the 
feature-space and the inter-scale predictability respectively, while step II selects 
optimal scales. We discuss each of these steps next. 
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Fig. 3. The two entropy peaks shown in (a) correspond to the centre (in blue) and 
edge (in red) points in top image. Both peaks occur at similar magnitudes. 



The entropy of local attributes measures the predictability of a region with 
respect to an assumed model of simplicity. In the case of entropy of pixel in- 
tensities, the model of simplicity corresponds to a piecewise constant region. 
For example, in Figure 2(a), at the particular scales shown, the PDF of inten- 
sities in the cheek region is peaked. This indicates that most of these pixels are 
highly predictable, hence entropy is low. However, the PDF in the eye region is 
flatter which indicates that here, pixel values are highly unpredictable and this 
corresponds to high entropy. 

In step II, scales are selected at which the entropy is peaked. Through search- 
ing for such extrema, the feature-space saliency is locally optimised. Moreover, 
since entropy is maximised when the PDF is flat, i.e. all present attribute values 
are in equal proportion, such peaks typically occur at scales where the statistics 
of two (or more) different pixel populations contribute equally to the PDF esti- 
mate. Figure 3(b) shows entropy as a function of scale for two points in Figure 
3(a). The peaks in entropy occur at scales for which there are equal proportions 
of black and white pixels present. These significant, or salient scales, in the en- 
tropy function (analogous to the ‘critical-points’ in Gaussian scale-space [11,15]) 
serve as useful reference points since they are covariant with isotropic scaling, 
invariant to rotation and translation, and robust to small affine shears. 

Note however, that the peaks for both points in Figure 3(b) attain an almost 
identical magnitude. This is to be expected since both patches contain almost 
identical proportions of black and white pixels. In fact, since histogramming 
destroys all local ordering information all permutations of the local patch do 
not affect its entropy. Figure 2(b) shows the entropy over scale function for an 
image patch taken from 2(a) and three permutations of its pixels: a linear ramp, 
a random reordering and a radial gradient. The entropy at the maximum scale 
(that of the whole patch) is the same for all permutations. However, the shape 
of the entropy function is quite different for each case. 

The role of Step III, the inter-scale unpredictability measure Wd, is to weight 
the entropy value such that some permutations are preferred over others. It is 
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defined as the magnitude change of the PDF as a function of scale, therefore 
those orderings that are statistically self-dissimilar over scale are ranked higher 
than those that exhibit stationarity. 

Figure 3(c) shows Wd as a function of scale. It can be seen that the plot 
corresponding to the edge point has a much lower value than the one for the 
centre point at the selected scale value. In essence, it is a normalised measure of 
scale localisation. For example, in a noise image the pixel values are highly un- 
predictable at any one scale but over scale the statistics are stationary. However, 
a noise patch against a plain background would be salient due to the change in 
statistics. 

In the continuous case, the saliency measure (Vd, a function of scale s and 
position x, is defined as: 

3^d(s p ,x) = Wd(s p ,x) Wj)(s p ,x) (1) 

i.e. for each point x the set of scales s p , at which entropy peaks, is obtained, 
then the saliency is determined by weighting the entropy at these scales by Wd- 
Entropy, Hd, is given by: 

I Hd(s, x) = - J p(I , s, x) log 2 p(I, s, x) d I (2) 



where p(I , s, x) is the probability density of the intensity / as a function of scale 
s and position x. The set of scales s p is defined by: 
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The inter-scale saliency measure, W_d(s,x), is defined by: 



III 



Wd(s,x) 




d_ 

ds 



p(/,s,x) 



d I 



(4) 



In this paper, entropy is measured for the grey level image intensity but other 
attributes, e.g. colour or orientation, may be used instead; see [8] for examples. 

This approach has a number of attractive properties. It offers a more general 
model of feature saliency and scale compared to conventional feature detection 
techniques. Saliency is defined in terms of spatial unpredictability; scale by the 
sampling window and its parameterisation. For example, a blob detector im- 
plemented using a convolution of multiple scale Laplacian-of-Gaussian (LoG) 
functions [14], whilst responding to a number of different feature shapes, maxi- 
mally responds only to LoG function itself (or its inverse); in other words, it acts 
as a matched filter 1 . Many convolution based approaches to feature detection 
exhibit the same bias, i.e. a preference towards certain features. This specificity 
has a detrimental effect on the quality of the features and scales selected. In 

1 This property is somewhat alleviated by the tendency of blurring to smooth image 
structures into LoG like functions. 
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contrast, the saliency approach responds equally to the LoG and all other per- 
mutations of its pixels provided that the constraint on Wd{s) is satisfied. This 
property enables the method to perform well over intra-class variations as is 
demonstrated in Section 4. 



2.2 Affine Invariant Saliency 

In the original formulation of [10], the method was invariant to the similarity 
group of geometric transformations and to photometric shifts. In this section, 
we develop the method to be fully affine invariant to geometric transformations. 
In principle, the modification is quite straightforward and may be achieved by 
replacing the circular sampling window by an ellipse: under an affine transfor- 
mation, circles map onto ellipses. The scale parameter s is replaced by a vector 
s = (s, p, 0), where p is the axis ratio and 9 the orientation of the ellipse. Under 
such a scheme, the major and minor axes of the ellipse are given by s/ \fp and 
s^fp respectively. 

Increasing the dimensionality of the sampling window creates the possibility 
of degenerate cases. For example, in the case of a dark circle against a white 
background (see Figure 3(a)) any elliptical sampling window that contains an 
equal number of black and white pixels (7~Ld constraint) but does not exclude 
any black pixels at the previous scale (Wd constraint) will be considered equally 
salient. Such cases are avoided by requiring that the inter-scale saliency, Wd, 
is smooth across a number of scales. A simple way to achieve this is to apply a 
3-tap averaging filter to Wd over scale. 



2.3 Local Search 

The complexity of a full search can be significantly reduced by adopting a local 
strategy in the spirit of [2,19,20] . Our approach is to start the search only at seeds 
points (positions and scales) found by applying the original similarity invariant 
search. Each seed circle is then locally adapted in order to maximise two criteria, 
Hd (entropy) and Wd (inter-scale saliency). Wd is maximised when the ratio 
and orientation match that of the local image patch [9] at the correct scale, 
defined by a peak in T-Ld- Therefore, we adopt an iterative refinement approach. 
The ratio and orientation are adjusted in order to maximise Wd, then the scale 
is adjusted such that Hd is peaked. The search is stopped when neither the scale 
nor shape change (or a maximum iteration count is exceeded) . 

The final set of regions are chosen using a greedy clustering algorithm which 
operates from the most salient feature down (highest value of 3^n) and clusters 
together all features within the support region of the current feature. A global 
threshold on value or number is used. 

The performance of this local method is compared to exhaustive search in 
Section 3. 
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(a) (b) (c) 



Fig. 4. Wd as a function of x and scale for image shown in (a) at the y-position 
indicated by the dashed line using standard sampling (b), and anti-aliased sampling 
(c). 



2.4 Anti-aliased Sampling 

The simplest method for estimating local PDFs from images is to use histogram- 
ming over a local neighbourhood, for example a circular region; pixels inside the 
region are counted whilst those outside are not. However, this binary approach 
gives rise to step changes in the histogram as the scale is increased. W D is es- 
pecially sensitive to this since it measures the difference between two concentric 
sampling windows. For example, Figure 4(b) shows the variation of Wd as a 
function of x and scale for the image shown in 4(a). The surface is taken at a 
point indicated by the dashed line. Somewhat surprisingly the surface is highly 
irregular and noisy even for this ideal noise-free image, consequently, so is the 
saliency space. Intuitively, the solution to this problem lies with a smoother 
transition between the pixels that are included in the histogram and the ones 
that are not. 

The underlying problem is, in fact, an instance of aliasing. Restated from a 
sampling perspective, the binary representation of the window is sampling with- 
out pre-filtering. Evidently, this results in severe aliasing. This problem has long 
been recognised in the Computer Graphics community and numerous methods 
have been devised to better represent primitives on a discrete display [5]. 

To overcome this problem we use a smooth sampling window (i.e. a filtered 
version of the ideal sampling window). However, in contrast to the CG applica- 
tion, here, the window weights the contributions of the pixels to the histogram 
not the pixel values themselves; pixels near the edge contribute less to the count 
than ones near the centre. It does not blur the image. 

Griffin [6] and Koenderink and van Doom [12] have suggested weighting his- 
togram counts using a Gaussian window, but not in relation to anti-aliasing. 
However, for our purposes, the Gaussian poorly represents the statistics of the 
underlying pixels towards the edges due to the slow drop-off. Its long tails cause 
a slow computation since more pixels will have to be considered and also re- 
sults in poor localisation. The traditional ‘pro-Gaussian’ arguments do not seem 
appropriate here. 
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Analytic solutions for the optimal sampling window are, in theory at least, 
possible to obtain. However, empirically we have found the following function 
works well: 

2 = \/(^) 2 + ( »'^ )1 - (5) 

with n = 42 where x' = x cos 9 + y sin 9 and y' = y cos 9 — x sin 9 achieves 
the desired rotation. We truncate for small values of SW(z). This sampling 
window gives scalar values as a function of distance, z, from the window centre, 
which are used to build the histogram. Figure 4(c) shows the same slice through 
W D space but generated using Equation 5 for the sampling weights. Further 
implementation details and analysis may be found in [9,10]. 



3 Performance under Viewpoint Variations 



The objective here is to determine the extent to which detected regions commute 
with viewpoint. This is an example of the global transformation requirement 
discussed in the introduction. 

For these experiments, we follow the testing methodology proposed in [18, 
19]. The method is applied to an image set 2 comprising different viewpoints of 
the same (largely planar) scene for which the inter-image homography is known. 
Repeatability is determined by measuring the area of overlap of corresponding 
features. Two features are deemed to correspond if their projected positions differ 
by less than 1.5 pixels. Results are presented in terms of error in overlapping 
area between two ellipses y a ,Hb'- 



Ha n [A T HbA) 
a U (A T Hb A) 



where A defines a locally linearized affine transformation of the homography 
between the two images and y a (~l (A T y b A) and y a U (A T y b A) represent the area 
of intersection and union of the ellipses respectively. 

Figure 5(a) shows the repeatability performance as a function of viewpoint 
of three variants of the affine invariant salient region detector: exhaustive search 
without anti-aliasing (FS Affine ScaleSal), exhaustive search with anti-aliasing 
( AA FS Affine ScaleSal) , and local search with anti-aliasing (AA LS Affine Scale- 
Sal). The performance is compared to the detector of Mikolajczyk and Schmid 
[19], denoted Affine MSHar. Results are shown for eg < 0.4. 

It can be seen that the full search Affine Saliency and Affine MSHar features 
have a similar performance over the range of viewpoints. However, from 40° the 
anti-aliased sampling provides some gains, though curiously diminishes perfor- 
mance at 20°. The local search anti-aliased Affine Saliency performs reasonably 
well compared to the full search methods but of course takes a fraction of the 
time to compute. 

2 Graffiti6 from http://www.inrialpes.fr/lear/people/Mikolajczyk/ 



236 



T. Kadir, A. Zisserman, and M. Brady 




(a) (b) 





Fig. 5. Repeatability results under (a) viewpoint changes, (b,d) background perturba- 
tions and intra-class variations for Bike images, (c) intra-class variations for car and 
face images. Plots (b,c) are for similarity invariant and (d) for affine invariant detectors. 



4 Performance under Intra-class Variation and Image 
Perturbations 

The aim here is to measure the performance of a region detector under intra-class 
variations and image perturbations - the other two requirements specified in the 
introduction. In the following subsections we develop this measure and then 
compare performance of the salient region detector to other region operators. 
In these experiments we used similarity invariant versions of three detectors: 
similarity Saliency (ScaleSal), Difference-Of-Gaussian (DoG) blob detector [16] 
and the multi-scale Harris (MSHar) with Laplacian scale selection — this is 
Affine MSHar without the affine adaptation [19]. We also used affine invariant 
detectors Affine ScaleSal and Affine MSHar. An affine invariant version of the 
DoG detector was not available. 
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4.1 The Performance Measure 

We will discuss first measuring repeatability over intra-class variation. Suppose 
we have a set of images of the same object class, e.g. motorbikes. A region 
detection operator which is unaffected by intra-class variation will reliably select 
regions on corresponding parts of all the objects, say the wheels, engine or seat 
for motorbikes. Thus, we assess performance by measuring the (average) number 
of correct correspondences over the set of images. 

The question is: what constitutes a correct corresponding region? To deter- 
mine this, we use a proxy to the true intra-class transformation by assuming 
that an affinity approximately maps one imaged object instance to another. The 
affinities are estimated here by manually clicking on corresponding points in each 
image, e.g. for motorbikes the wheels and seat/petrol tank join. We consider a 
region to match if it fulfils three requirements: its position matches within 10 
pixels; its scale is within 20% and normalised mutual information 3 between the 
appearances is > 0.2. For the affine invariant detectors, the scale test is replaced 
with the overlap error, e s < 0.4 (Eq. 6), and the mutual information is applied 
to elliptical patches transformed to circles. These are quite generous thresholds 
since the objects are different and the geometric mapping approximate. 

In detail we measure the average correspondence score S as follows. N regions 
are detected on each image of the M images in the dataset. Then for a particular 
reference image i the correspondence score S) is given by the proportion of 
corresponding to detected regions for all the other images in the dataset, i.e.: 

^ Total number of matches N l M 

* Total number of detected regions N{M — 1) 

The score 5) is computed for M/2 different selections of the reference image, 
and averaged to give S. The score is evaluated as a function of the number of 
detected regions N. For the DoG and MSHar detectors the features are ordered 
on Laplacian (or DoG) magnitude strength, and the top N regions selected. 

In order to test insensitivity to image perturbation the data set is split into 
two parts: the first contains images with a uniform background and the second, 
images with varying degrees of background clutter. If the detector is robust to 
background clutter then the average correspondence score S should be similar 
for both subsets of images. 



4.2 Intra-class Variation Results 

The experiments are performed on three separate data-sets, each containing 
different instances from an object class: 200 images from Caltech Motorbikes 
(Side), 200 images from Caltech Human face (Front), and all 126 Caltech Cars 
(Rear) images. Figure 6 shows examples from each data set 4 . 

3 MI (A, B ) = 2 (H(A) + H(B) - H(A, B))/(H(A) + H{B )) 

4 Available from http://www. robots. ox.ac.uk/~vgg/data/. 
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Fig. 6. Example images from (a) the two parts of the Caltech motorbike data set with- 
out background clutter (top) and with background clutter (bottom), and (b) Caltech 
cars (top) and Caltech faces (bottom). 



The average correspondence score S results for the similarity invariant de- 
tectors are shown in Figure 5(b) and (c). Figure 5(d) shows the results for the 
affine detectors on the motorbikes. For all three data sets and at all thresholds 
the best results are consistently obtained using the saliency detector. However, 
the repeatability for all the detectors is lower for the face and cars compared to 
the motorbike case. This could be due to the appearances of the different object 
classes; motorbikes tend to appear more complex than cars and faces. 

Figure 7 shows smoothed maps of the locations at which features were de- 
tected in all 200 images in the motorbike image set. All locations have been back 
projected onto a reference image. Bright regions are those at which detections 
are more frequent. The map for the saliency detector indicates that most detec- 
tions are near the object with a few high detection points near the engine, seats 
wheel centres, headlamp. In contrast, the DoG and MSHar maps show a much 
more diffuse pattern over the entire area caused by poor localisation and false 
responses to background clutter. 



4.3 Image Perturbation Results 

The motorbike data set is used to assess insensitivity to background clutter. 
There are 84 images with a uniform background, and 116 images with varying 
degrees of background clutter; see Figure 6(a). 

Figure 5(b) shows separate plots for motorbike images with and without 
background clutter at N=10 to 40. The saliency detector finds, on average, ap- 
proximately 25% of 30 features within the matching constraints; this corresponds 
to about 7 features per image on average. In contrast, the MSHar and DoG de- 
tectors select 2-3 object features per image at this threshold. Typical examples 
of the matched regions selected by the saliency detector on this data set are 
shown in Figure 4.3. 
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MultiScale Harris Difference-of-Gaussian Saliency 



Fig. 7. Smoothed map of the detected features over all 200 images in the motorbike 
set back projected onto one image. The colour indicates the normalised number of 
detections in a given area (white is highest). Note the relative ‘tightness’ of the bright 
areas of the saliency detector compared to the DoG and MSHarr. 



# T' 










Fig. 8. Examples of the matched regions selected by the similarity saliency detector 
from the motorbike images: whole front wheels; front mud-guard/ wheel corner; seat; 
headlamp. 



There is also a marked difference in the way the various detectors are affected 
by clutter. It has little effect on the ScaleSal detector whereas it significantly 
reduces the DoG performance and similarly that of MSHar. Similar trends are 
obtained for the affine invariant detectors applied to the motorbikes images, 
shown in Figure 5(d). 

Local perturbations due to changes in the scene configuration, background 
clutter or changes within in the object itself can be mitigated by ensuring com- 
pact support of any probing elements. Both the DoG and MSHar methods rely 
on relatively large support windows which cause them to be affected by non-local 
changes in the object and background; compare the two cluttered and unclut- 
tered background results for the motorbike experiments. 

There may be several other relevant factors. First, both the DoG and MSHar 
methods blur the image, hence causing a greater degree of similarity between 
objects and background. Second, in most images the objects of interest tend to 
be in focus while backgrounds are out of focus and hence blurred. Blurred regions 
tend to exhibit slowly varying statistics which result in a relatively low entropy 
and inter-scale saliency in the saliency detector. Third, the DoG and MSHar 
methods define saliency with respect to specific properties of the local surface 
geometry. In contrast, the saliency detector uses a much broader definition. 
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5 Discussion and Future Work 

In this paper we have presented a new region detector which is comparable to 
the state of the art [19,20] in terms of co-variance with viewpoint. We have also 
demonstrated that it has superior performance on two further criteria: robustness 
to image perturbations, and repeatability under intra-class variability. The new 
detector extends the original method of Kadir and Brady to affine invariance; we 
have developed a properly anti-aliased implementation and a fast optimisation 
based on a local search. 

We have also proposed a new methodology to test detectors under intra-class 
variations and background perturbations. Performance under this extended cri- 
terion is important for many applications, for example part detectors for object 
recognition. 

The intra-class experiments demonstrate that defining saliency in the manner 
of the saliency detector is, on average, a better search heuristic than the other 
region detectors tested on at least the three data sets used here. 

It is interesting to consider how the design of feature detectors affects perfor- 
mance. Many global effects, such as viewpoint, scale or illumination variations 
can be modelled mathematically and as such can be tackled directly provided 
the detector also lends itself to such analysis. Compared to the diffusion-based 
scale-spaces, relatively little is currently known about the properties of spaces 
generated by statistical methods such as that described here. Further investiga- 
tion of its properties seems an appealing line of future work. 

We plan to compare the saliency detector to other region detection ap- 
proaches which are not based on filter response extrema such as [17,22] 
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Abstract. We extend the constellation model to include heterogeneous parts 
which may represent either the appearance or the geometry of a region of the 
object. The parts and their spatial configuration are learnt simultaneously and 
automatically, without supervision, from cluttered images. 

We describe how this model can be employed for ranking the output of an image 
search engine when searching for object categories. It is shown that visual con- 
sistencies in the output images can be identified, and then used to rank the images 
according to their closeness to the visual object category. 

Although the proportion of good images may be small, the algorithm is designed 
to be robust and is capable of learning in either a totally unsupervised manner, or 
with a very limited amount of supervision. 

We demonstrate the method on image sets returned by Google’s image search for 
a number of object categories including bottles, camels, cars, horses, tigers and 
zebras. 



1 Introduction 

Just type a few keywords into the Google image search engine, and hundreds, sometimes 
thousands of pictures are suddenly available at your fingertips. As any Google user is 
aware, not all the images returned are related to the search. Rather, typically more than 
half look completely unrelated; moreover, the useful instances are not returned first - 
they are evenly mixed with unrelated images. This phenomenon is not difficult to explain: 
current Internet image search technology is based upon words, rather than image content 
- the filename of the image and text near the image on a web-page [4], These criteria 
are effective at gathering quickly related images from the millions on the web, but the 
final outcome is far from perfect. 

We conjecture that, even without improving the search engine per se, one might 
improve the situation by measuring ‘visual consistency’ amongst the images that are 
returned and re-ranking them on the basis of this consistency, so increasing the fraction 
of good images presented to the user within the first few web pages. This conjecture 
stems from the observation that the images that are related to the search typically are 
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visually similar, while images that are unrelated to the search will typically look different 
from each other as well. 

How might one measure ‘visual consistency’ ? One approach is to regard this problem 
as one of probabilistic modeling and robust statistics. One might try and fit the data (the 
mix of images returned by Google) with a parametrized model which can accommodate 
the within-class variation in the requested category, for example the various shapes and 
labels of bottles, while rejecting the outliers (the irrelevant images). Learning a model 
of the category under these circumstances is an extremely challenging task. First of all: 
even objects within the same category do look quite different from each other. Moreover, 
there are the usual difficulties in learning from images such as lighting and viewpoint 
variations (scale, foreshortening) and partial occlusion. Thirdly, and most importantly, 
in the image search scenario the object is actually only present in a sub-set of the images, 
and this sub-set (and even its size) is unknown. 

While methods exist to model object categories [9,13,15], it is essential that the ap- 
proach can learn from a contaminated training set with a minimal amount of supervision. 
We therefore use the method of Fergus etal. [10], extending it to allow the parts to be het- 
erogeneous, representing a region’s appearance or geometry as appropriate. The model 
and its extensions are described in section 2. The model was first introduced by Burl et 
al. [5]. Weber et al. [23] then developed an EM-based algorithm for training the model 
on cluttered datasets with minimal supervision. In [ 10] a probabilistic representation for 
part appearance was developed; the model made scale invariant; and both appearance 
and shape learnt simultaneously. 

Other approaches to this problem [7,19] use properties of colour or texture his- 
tograms. While histogram approaches have been successful in Content Based Image 
Retrieval [2,12,21], they are unsuitable for our task since the within-class returns vary 
widely in colour and texture. 

We explore two scenarios: in the first the user is willing to spend a limited amount of 
time (e.g. 20-30 seconds) picking a handful of images of which they want more examples 
(a simple form of relevance feedback [20]); in the second the user is impatient and there 
is no human intervention in the learning (i.e. it is completely unsupervised). 

Since the model only uses visual information, a homonymous category (one that 
has multiple meanings, for example “chips” would return images of both “French fries” 
and “microchips”) pose problems due to multiple visual appearances. Consequently we 
will only consider categories with one dominant meaning in this paper. The algorithm 
only requires images as its input, so can be used in conjunction with any existing search 
engine. In this paper we have chosen to use Google’s image search. 



2 The Model 

In this section we give an overview of our previously developed method [10], together 
with the extension to heterogeneous parts. 

An object model consists of a number of parts which are spatially arranged over the 
object. A part here may be a patch of pixels or a curve segment. In either case, a part 
is represented by its intrinsic description (appearance or geometry), its scale relative to 
the model, and its occlusion probability. The overall model shape is represented by the 
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mutual position of the parts. The entire model is generative and probabilistic, so part 
description, scale, model shape and occlusion are all modeled by probability density 
functions, which are Gaussians. 

The process of learning an object category is one of first detecting features with 
characteristic scales, and then estimating the parameters of the above densities from these 
features, such that the model gives a maximum-likelihood description of the training 
data. Recognition is performed on a query image by again first detecting features (and 
their scales), and then evaluating the features in a Bayesian manner, using the model 
parameters estimated in the learning. 

2.1 Model Structure Overview 

A model consists of P parts and is specified by parameters 9. Given N detected features 
with locations X, scales S, and descriptions D, the likelihood that an image contains an 
object is assumed to have the following form: 

p(x,s,D|0) = g p(D|x s,h,g)p(x|S h,g) K SM) p(h|0) 

Part Description Shape Rel. Scale Other 

where the summation is over allocations, h, of parts to features. Typically a model has 
5-7 parts and there will be around thirty features of each type in an image. 

Similarly it is assumed that non-object background images can be modeled by a 
likelihood of the same form with parameters 6f, g . The decision as to whether a particular 
image contains an object or not is determined by the likelihood ratio: 

„ P(X,S,D|fl) 

p(X,S,B\O bg ) K} 

The model, at both the fitting and recognition stages, is scale invariant. Full details of 
the model and its fitting to training data using the EM algorithm are given in [10], and 
essentially the same representations and estimation methods are used. 

2.2 Heterogeneous Parts 

Existing approaches to recognition learn a model based on a single type of feature (e.g. 
image patches [3,16], texture regions [18] or Haar wavelets [22]). However, the different 
visual nature of objects means that this is limiting. For some objects, like wine bottles, 
the essence of the object is captured far better with geometric information (the outline) 
rather than by patches of pixels. Of course, the reverse is true for many objects, like 
humans faces. Consequently, a flexible visual recognition system must have multiple 
feature types. The flexible nature of the constellation model makes this possible. As 
the description densities of each part are independent, each can use a different type of 
feature. 

In this paper, only two types of features are included, although more can easily be 
added. The first consists of regions of pixels, this being the feature type used previously; 
the second consists of curve segments. Figure 1 illustrates these features on two typical 
images. These feature are complementary: one represents the appearance of object 
patches, the other represents the object geometry. 
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(d) (e) (f) 



Fig. 1. (a) Sample output from the region detector. The circles indicate the scale of the region, (b) 
A long curve segment being decomposed at its bitangent points, (c) Curves within the similarity- 
invariant space - note the clustering, (d), (e) & (f) show the curve segments identified in three 
images. The green and red markers indicate the start and end of the curve respectively 



2.3 Feature Detection 

Pixel patches. Kadir and Brady’s interest operator [14] finds regions that are salient 
over both location and scale. It is based on measurements of the grey level histogram 
and entropy over the region. The operator detects a set of circular regions so that both 
position (the circle centre) and scale (the circle radius) are determined, along with a 
saliency score. The operator is largely invariant to scale changes and rotation of the 
image. For example, if the image is doubled in size then a corresponding set of regions 
will be detected (at twice the scale). Figure 1(a) shows the output of the operator on a 
sample image. 

Curve segments. Rather than only consider very local spatial arrangements of edge 
points (as in [ 1 ]), extended edge chains are used, detected by the Canny edge operator [6] . 
The chains are then segmented into segments between bitangent points, i.e. points at 
which a line has two points of tangency with the curve. Figure 1(b) shows an example. 

This decomposition is used for two reasons: first, bitangency is covariant with projec- 
tive transformations. This means that for near planar curves the segmentation is invariant 
to viewpoint, an important requirement if the same, or similar, objects are imaged at dif- 
ferent scales and orientations. Second, by segmenting curves using a bi-local property 
interesting segments can be found consistently despite imperfect edgel data. 

Bitangent points are found on each chain using the method described in [17]. Since 
each pair of bitangent points defines a curve which is a sub-section of the chain, there 
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may be multiple decompositions of the chain into curved sections as shown in figure 1(b). 
In practice, many curve segments are straight lines (within a threshold for noise) and 
these are discarded as they are intrinsically less informative than curves. In addition, the 
entire chain is also used, so retaining convex curve portions. 

2.4 Feature Representation 

The feature detectors gives patches and curves of interest within each image. In order 
to use them in our model their properties are parametrized to form D = [A, G] where 
A is the appearance of the regions within the image, and G is the shape of the curves 
within each image. 



Region representation. As in [10], once the regions are identified, they are cropped 
from the image and rescaled to a smaller, 11x11 pixel patch. The dimensionality is 
then reduced using principal component analysis (PCA). In the learning stage, patches 
from all images are collected and PCA performed on them. Each patch’s appearance is 
then a vector of the coordinates within the first 15 principal components, so giving A. 

Curve representation. Each curve is transformed to a canonical position using a sim- 
ilarity transformation such that it starts at the origin and ends at the point (1,0). If the 
curve’s centroid is below the a; -ax is then it is flipped both in the a-axis and the line 
y = 0.5, so that the same curve is obtained independent of the edgel ordering. The y 
value of the curve in this canonical position is sampled at 13 equally spaced x inter- 
vals between (0, 0) and (1, 0). Figure 1(c) shows curve segments within this canonical 
space. Since the model is not orientation-invariant, the original orientation of the curve 
is concatenated to the 13-vector for each curve, giving a 15-vector (for robustness, ori- 
entation is represented as a normalized 2-vector). Combining the 15-vectors from all 
curves within the image gives G. 

2.5 Model Structure and Representation 

The descriptors are modelled by the p(D|X, S, h, 0) likelihood term. Each part models 
either curves or patches and this allocation is made beforehand, h picks a feature for each 
part from A or G (as appropriate) and is then modelled by a 15 dimensional Gaussian 
(note that both curves and patches are represented by a 15-vector). This Gaussian will 
hopefully find a cluster of curves/patches close together in the space, corresponding to 
similar looking curves or patches across images. The relative locations of the model 
parts are modelled by p(X|S, h, 6) - which is a joint Gaussian density over all parts. 
Again, h allocates a feature to each part. The location of curve is taken as its centroid. 
The location of a patch is its region centre. For the relative scale term, p(S|h, 9) - again 
a Gaussian, the length of the curve and the radius of a patch region is taken as being the 
scale for a curve/patch. 
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3 Method 

In this section the experimental implementation is described: the gathering of images, 
feature detection, model learning and ranking. The process will be demonstrated on the 
“bottles” category . 

3.1 Image Collection 

For a given keyword, Google’s image search 1 was used to download a set of images. 
Images outside a reasonable size range (between 100 and 600 pixels on the major axis) 
were discarded. A typical image search returned in the region of 450-700 usable images. 
A script was used to automate the procedure. For assessment purposes, the images 
returned were divided into 3 distinct groups (see fig. 2): 

1. Good images: these are good examples of the keyword category, lacking major 
occlusion, although there may be a variety of viewpoints, scalings and orientations. 

2. Intermediate images: these are in some way related to the keyword category, but 
are of lower quality than the good images. They may have extensive occlusion; 
substantial image noise; be a caricature or cartoon of the category; or the category 
is rather insignificant in the image, or some other fault. 

3. Junk images: these are totally unrelated to the keyword category. 

Additionally, a dataset consisting entirely of junk images was collected, by using the 
keyword “things”. This background dataset is used in the unsupervised learning proce- 
dure. 

The algorithm was evaluated on ten datasets gathered from Google: bottles, camel, 
cars, coca cola, horses, leopards, motorbike, mugs, tiger and zebra. It is worth noting 
that the inclusion or exclusion of an “s” to the keyword can make a big difference to the 
images returned. The datasets are detailed in Table 1. 

Table 1 . Statistics of the datasets as returned by Google. 



Dataset 


Bottles 


Camel 


Cars 


Coca-cola 


Horses 


Leopards 


Motorbike 


Mugs 


Tiger 


Zebra 


Things 


Total size of dataset 


700 


700 


448 


500 


600 


700 


500 


600 


642 


640 


724^ 


% Good images 


41 


24 


30 


17 


21 


49 


25 


50 


35 


44 


n.a. 


% Intermediate images 


26 


27 


18 


12 


25 


33 


16 


9 


24 


33 


n.a. 


% Junk images 


33 


49 


52 


71 


54 


18 


59 


41 


41 


24 


n.a. 



3.2 Image Re-ranking 

Feature detection. Each image is converted to greyscale, since colour information is 
not used in the model. Curves and regions of interest are then found within the image, 
using exactly the same settings for all datasets. This produces X, D and S for use in 
learning or recognition. The 25 regions with the highest saliency, and 30 curves with the 
longest length are used from each image. 

1 http : //www . google . com/ imghp. Date of collection: Jan. 2003. As we write (Feb. 2004) we 
notice that Google’s precision-recall curves have improved during the last 12 months. 
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Fig. 2. Images of bottles, (a) the first 25 images returned by Google. The coloured dot in the bottom 
right hand corner indicates the ground truth category of the image: good (green); intermediate 
(yellow) or junk (red), (b) the 10 hand selected images used in the supervised experiments. 



Model Learning. The learning process takes one of two distinct forms: unsupervised 
learning and limited supervision: 

- Unsupervised learning: In this scenario, a model is learnt using all images in the 
dataset. No human intervention is required in the process. 

- Learning with limited supervision: An alternative approach is to use relevance- 
feedback. The user picks 10 or so images that are close to the image he/she wants, 
see figure 2(b) for examples for the bottles category. A model is learnt using these 
images. 

In both approaches, the learning task takes the form of estimating the parameters 0 
of the model discussed above. The goal is to find the parameters 0 ml which best explain 
the data X, D, S from the chosen training images (be it 10 or the whole dataset), i.e. 
maximise the likelihood: 0ml — argmaxg p(X, D, S| 0). For the 5 part model used 
in the experiments, there are 243 parameters. In the supervised learning case, the use of 
only 10 training images is a compromise between the number the user can be expected 
to pick and the generalisation ability of the model. The model is learnt using the EM 
algorithm as described in [10]. Figure 3 shows a curve model and a patch model trained 
from the 10 manually selected images of bottles. 

Re-ranking. Given the learnt model, the likelihood ratio (eqn. 1) for each image is 
computed. This likelihood ratio is then used to rank all the images in the dataset. Note 
that in the supervised case, the 10 images manually selected are excluded from the 
ranking. 
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Fig. 3. Models of bottles, (a) & (b): Curve model, (c) & (d): Patch model, (a) The spatial layout 
of the curve model with mean curves overlaid. The X and Y axes are in arbitrary units since the 
model is scale-invariant. The ellipses indicate variance in relative location, (b) Patch of images 
selected by curve features from high scoring hypotheses, (c) Spatial layout for patch model, (d) 
Sample patches closest to mean of appearance density. Both models pick out bottle necks and 
bodies with the shape model capturing the side-by-side arrangement of the bottles. 



Speed considerations. If this algorithm is to be of practical value, it must be fast. Once 
images have been preprocessed, which can be done off-line, a model can be learnt from 
10 images in around 45 seconds and the images in the dataset re-ranked in 4 — 5 seconds 
on a 2 Ghz processor. 

3.3 Robust Learning in the Unsupervised Case 

We are attempting to learn a model from a dataset which contains valid data (the good 
images) but also outliers (the intermediate and junk images), a situation faced in the area 
of robust statistics. One approach would be to use all images for training and rely on the 
models’ occlusion term to account for the small portion of valid data. However, this re- 
quires an accurate modelling of image clutter properties and reliable convergence during 
learning. An an alternative approach, we adapt a robust fitting algorithm, RAN SAC [11], 
to our needs. A large number of models are trained (~ 100), each one using a set of ran- 
domly drawn images sufficient to train a model (10 in this case). The intuition is that at 
least one of these will be trained on a higher than average proportion of good images, so 
will be a good classifier. The challenge is to find a robust unsupervised scoring function 
that is highly correlated to the underlying classification performance. The model with 
the highest score is then picked as model to perform the re-ranking of the dataset. 

Our novel scoring approach uses a second set of images, consisting entirely of irrel- 
evant images, the aforementioned background dataset. Thus there are now two datasets: 
(a) the one to be ranked (consisting of a mixture of junk and good images) and (b) the 
background dataset. Each model evaluates the likelihood of images from both datasets 
and a differential ranking measure is computed between them. In this instance, we com- 
pute the area under a recall-precision curve (RPC) between the two datasets. In our 
experiments we found a good correlation between this measure and the ground truth 
RPC precision: the final model picked was consistently in the top 15% of models, as 
demonstrated in figs. 4(c) & (d). 
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Fig. 4. (a) & (b) Recall-Precision curves computed using ground truth for the supervised models 
in figure 3. In (a), the good images form the positive set and the intermediate and junk images 
form the negative one. In (b), good and intermediate images form the positive set and junk images, 
the negative one. The dotted blue line is the curve of the raw Google images (i.e. taken in the 
order they are presented to the user). The solid red line shows the performance of the curve model 
and the dashed green line shows the performance of the patch model. As most users will only 
look at the first few pages of returned results, the interesting area of the plots is the left-hand 
side of the graph, particularly around a recall of 0.15 (as indicated by the vertical line). In this 
region, the curve model clearly gives an improvement over both the raw images and the patch 
model (as predicted by the variance measure), (c) & (d): Scatter plots showing the scoring RPC 
area versus ground truth RPC area for curve and patch models respectively in the unsupervised 
learning procedure. Each point is a model learnt using the RANSAC-style unsupervised learning 
algorithm. The model selected for each feature type is indicated by the red circle. Note that in 
both plots it is amongst the best few models. 



3.4 Selection of Feature Type 

For each dataset in both the supervised and unsupervised case, two different models 
are learnt: one using only patches and another using only curves. A decision must be 
made as to which model should give the final ranking that will be presented to the 
user. This is a challenging problem since the models exist in different spaces, so their 
likelihoods cannot be directly compared. Our solution is to compare the variance of the 
unsupervised models’ scoring function. If a feature type is effective then a large variance 
is expected since a good model will score much better than a mediocre one. However, 
an inappropriate feature type will be unable to separate the data effectively, no matter 
which training images were used, meaning all scores will be similar. 

Using this approach, the ratio of the variance of the RANSAC curve and patch 
models is compared to a threshold (fixed for all datasets) and a selection of feature type 
is made. This selection is then used for both the unsupervised and supervised learning 
cases. Figure 5 shows the first few re-ranked images of the bottles dataset, using the 
model chosen - in this case, curves. 

4 Results 

Two series of experiments were performed: the first used the supervised learning method 
while the second was completely unsupervised. In both sets, the choice between curves 
and patches was made automatically. The results of the experiments are summarised in 
table 2. 
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Fig. 5. Re-ranked bottle images. The dot in the bottom right corner shows the label of the image. 
The thin magenta curves on each image show the curve segments detected. The best hypothesis 
is also highlighted with thick coloured lines. The duplicate images present in the dataset are the 
reason that some of the 10 training images appear in the figure. Notice that the model seems to 
pick up the neck of the bottles, with its distinctive curvature. These images clearly contain more 
bottles than those of figure 2. 



Table 2. Summary of results: Precision at 15% recall - equivalent to around two web-pages 
worth of images. Good images vs. intermediate & junk. The second row gives raw Google output 
precision. Rows 3 & 4 give results of supervised learning, using 10 handpicked images. Rows 5 
& 6 give results of unsupervised RANSAC-style learning. Rows 7 & 8 are included to show the 
comparison of the RANSAC approach to unsupervised learning using all images in the dataset. 
Bold indicates the automatically selected model. For the forms of learning used (supervised and 
RANSAC-style unsupervised), this model selection is correct 90% of the time. The final column 
gives the average precision across all datasets, for the automatically chosen feature type. 



Dataset 


Bottles 


Camel 


Cars 


Coca-cola 


Horses 


Leopards 


Motorbike 


Mugs 


Tiger 


Zebra 


Average 


Raw Google 


39.3 


36.1 


31.7 


41.9 


31.1 


46.8 


48.7 


84.9 


30.5 


51.9 


44.3 


10 images (Curves) 


82.9 


80.0 


78.3 


35.3 


28.3 


39.5 


48.6 


75.0 


43.8 


74.1 


65.9 


10 images (Patches) 


52.3 


68.6 


47.4 


54.5 


23.6 


69.0 


42.5 


55.7 


72.7 


74.1 


RANSAC unsupervised-Curves 


81.4 


78.8 


69.0 


29.5 


25.0 


41.5 


61.3 


68.2 


43.4 


71.2 


58.9 


RANSAC unsupervised-Patches 


68.6 


48.7 


42.6 


26.0 


25.0 


50.0 


20.4 


66.7 


58.9 


54.5 


All images unsupervised-Curves 


76.1 


81.2 


41.7 


43.3 


23.2 


51.0 


34.5 


76.3 


44.0 


64.6 


52.9 


All images unsupervised-Patches 


35.0 


27.4 


44.4 


23.6 


22.4 


55.4 


17.9 


62.5 


53.2 


50.0 



4.1 Supervised Learning 

The results in table 2 show that the algorithm gives a marked improvement over the raw 
Google output in 7 of the 10 datasets. The evaluation is a stringent one, since the model 
must separate the good images from the intermediate and junk, rather than just separating 
the good from the junk. The curve features were used in 6 instances, as compared to 
4 for patches. While curves would be expected to be preferable for categories such as 
bottles, their marked superiority on the cars category, for example, is surprising. It can 
be explained by the large variation in viewpoint present in the images. No patch features 
could be found that were stable across all views, whereas long horizontal curves in close 
proximity were present, regardless of the viewpoint and these were used by the model, 
giving a good performance. Another example of curves being unexpectedly effective, is 
on the camel dataset, as shown in figure 6. Here, the knobbly knees and legs of the camel 
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Camel shape model 






Fig. 6. Camel. The algorithm performs well, even in the unsupervised scenario. The curve model, 
somewhat surprisingly, locks onto the long, gangly legs of the camel. From the RPC (good vs 
intermediate & junk), we see that for low recall (the first few web-pages returned), both the models 
have around double the precision of the raw Google images. 
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are found consistently, regardless of viewpoint and clutter, so are used by the model to 
give a precision (at 15% recall) over twice that of the raw Google images. The failure to 
improve Google’s output on 3 of the categories (horses, motorbikes and mugs), can be 
mainly attributed to an inability to obtain informative features on the object. It is worth 
noting that in these cases, either the raw Google performance was very good (mugs) or 
the portion of good images was very small (<25%). 

4.2 Unsupervised Learning 

In this approach, 6 of the 10 cases were significantly better than the raw Google output. 
Many of them were only slightly worse than the supervised case, with the motorbike 
category actually superior. This category is shown in figure 7. 

In table 2, RANSAC-style learning is compared to learning directly from all images 
in the dataset. The proportion of junk images in the dataset determines which of the 
two approaches is superior: using all images is marginally better when the proportion is 
small, while the RANSAC approach is decisively better with a large proportion of junk. 



5 Discussion and Future Work 

Reranking Google images based on their similarity is a problem that is similar to classical 
visual object recognition. However, it is worth noting the significant differences. In the 
classical setting of visual recognition we are handed a clean training set consisting 
of carefully labelled ’positive’ and ‘negative’ examples; we are then asked to test our 
algorithm on fresh data that was collected independently. In the present scenario the 
training set is not labelled, it contains a minority (20-50%) of ‘good’ examples, and a 
majority of either ‘intermediate’ or ‘junk’ examples. Moreover, after learning, our task 
is to sort the ‘training’ set, rather than work on fresh data. 

Selecting amongst models composed of heterogeneous features is a difficult chal- 
lenge in our setting. If we had the luxury of a clean labelled training set, then part of this 
could have been selected as a validation set and then used to select between all-curve and 
all-patch models. Indeed we could then have trained heterogeneous models where parts 
could be either curves or patches. However, the non-parametric RPC scoring methods 
developed here are not up to this task. 

It is clear that the current features used are somewhat limited in that they capture 
only a small fraction of the information from each image. In some of the datasets (e.g. 
horses) the features did not pick out the distinctive information of the category at all, 
so the model had no signal to deal with and the algorithm failed as a consequence. By 
introducing a wider range of feature types (e.g. corners, texture) a wider range of datasets 
should be accessible to the algorithm. 

Overall, we have shown that in the cases where the model’s features (patches and 
curves) are suitable for the object class, then there is a marked improvement in the 
ranking. Thus we can conclude that the conjecture of the introduction is valid - visual 
consistency ranking is a viable visual category filter for these datasets. 

There are a number of interesting issues in machine learning and machine vision 
that emerge from our experience: (a) Priors were not used in either of the learning 
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Fig. 7. Motorbike. The top scoring unsupervised motorbike model, selected automatically. The 
model picks up on the wheels of the bike, despite a wide range of viewpoints and clutter. The 
RPC (good vs intermediate & junk) shows the curve model performing better than Google’s raw 
output and the model based on patches (which is actually worse than the raw output). 
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scenarios. In Fei-Fei et al. [8] priors were incorporated into the learning process of the 
constellation model, enabling effective models to be trained from a few images. Applying 
these techniques should enhance the performance of our algorithm, (b) The ‘supervised’ 
case could be improved by using simultaneously the small labelled training data provided 
by the user, as well as the large unlabelled original dataset. Machine learning researchers 
are making progress on the problem of learning from ‘partially labeled’ data. We ought 
to benefit from that effort. 
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Abstract. We propose a solution to the problem of inferring the depth 
map, radiance and motion of a scene from a collection of motion- 
blurred and defocused images. We model motion-blur and defocus as 
an anisotropic diffusion process, whose initial conditions depend on the 
radiance and whose diffusion tensor encodes the shape of the scene, the 
motion field and the optics parameters. We show that this model is well- 
posed and propose an efficient algorithm to infer the unknowns of the 
model. Inference is performed by minimizing the discrepancy between the 
measured blurred images and the ones synthesized via forward diffusion. 
Since the problem is ill-posed, we also introduce additional Tikhonov 
regularization terms. The resulting method is fast and robust to noise as 
shown by experiments with both synthetic and real data. 



1 Introduction 

We consider the problem of recovering the motion, depth map and radiance of 
a scene from a collection of defocused and motion-blurred images. Defocus is 
commonly encountered when using cameras with a finite aperture lens, while 
motion-blur is common when the imaging system is moving. To the best of 
our knowledge, we are the first to address the above problem. Typically, this 
problem is approached by considering images that are affected either by defocus 
or by motion-blur alone. The first case is divided into two fields of research 
depending on which object one wants to recover. When we are interested in 
recovering the radiance from defocused (and possibly downsampled) images, we 
are solving a super-resolution problem [2]. If we are interested in recovering 
the depth map of the scene (and possibly the radiance), then we are solving 
the so-called problem of shape from defocus [8,12,15,17,19,6] .The second case 
corresponds to the problem of motion deblurring , where one is mainly interested 
in reconstructing the radiance, which can be thought of as the unblurred or ideal 
image, of a scene under the assumptions of Lambertian reflection and uniform 
illumination [3,4,14]. Motion deblurring is a problem of blind deconvolution [5] 
or blind image restoration [21], and, therefore, is related to a large body of 
literature [20]. 



T. Pajdla and J. Matas (Eds.): ECCV 2004, LNCS 3021, pp. 257-269, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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1.1 Contributions of This Paper 

The contribution of this paper is twofold: to link the estimation of the depth 
map of a scene to the recovery of the radiance and to introduce a simple and 
computationally efficient imaging model for images that are both defocused and 
motion-blurred. We model motion-blur via the depth map of the scene and the 
rigid motion of the camera, which requires at most 6 scalar numbers for 2 images 
(see section 2.2). This model avoids the artifacts of employing oversimplified mo- 
tion models (e.g. each point on the image plane moves with the same constant 
velocity) and yields better estimates than motion models where the motion field 
is completely unconstrained, due to its lower dimensionality. The second con- 
tribution of this paper is the introduction of a novel model for defocused and 
motion-blurred images in the framework of anisotropic diffusion, in the spirit of 
[10]. The literature on anisotropic diffusion is quite substantial and, therefore, 
this work relates also to [13,18,16]. 

We pose the inference problem as the minimization of the discrepancy be- 
tween the data and the model (i.e. the final value of the anisotropic diffusion for 
several different focal settings). The problem is ill-posed, it consists in finding 
a diffusion tensor and an unknown initial value from final values of parabolic 
equations. For this sake we introduce Tiklronov-type regularization, which also 
remedies an unwanted effect with respect to motion-blur, where a local minimum 
would be attained for zero motion in the absence of suitable regularization (see 
section 3). 

2 A General Model for Defocus and Motion-Blur 

2.1 An Imaging Model for Space- Varying Defocus 

Images captured with a camera are measurements of energy emitted from the 
scene. We represent an image with a function J : 12 C R 2 H- [0,oo), that maps 
pixels on the image plane to energy values. We assume that 17 is a bounded 
domain with piecewise smooth boundary df2. The intensity of the measured 
energy depends on the distance of the objects in the scene from the camera and 
the reflectance properties of their surfaces. We describe the surfaces of the objects 
with a function s : R 2 K > [0,oo), and the reflectance with another function 
r : R 2 i-)- [0, oo); s assigns a depth value to each pixel coordinate and it is called 
depth map. Similarly, r assigns an energy value to each point on the depth map 
s and it is called, with an abuse of terminology 1 , radiance. Furthermore, we 

1 In the context of radiometry, the term radiance refers to a more complex object that 
describes energy emitted along a certain direction, per solid angle, per foreshortened 
area and per time instant. However, in our case, since we do not change vantage 
point and the size of the optics and the CCD are considerably smaller than the size 
of the scene, each pixel will collect energy mostly from a single direction, and the 
change in the solid angle between different pixels is approximately negligible. Hence, 
a function of the position on the surface of the scene, which is the one we use, suffices 

to describe the variability of the radiance. 
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usually know lower and upper bounds 0 < s mm < s max for the depth map s, 
which we may incorporate as an additional inequality constraint of the form 

s min < a{x) < s max , \/x G n. (1) 



The energy measured by an image J also depends on the optics of the camera. 
We assume the optics can be characterized by a function h : i? x R 2 i-*- [0, oo), 
the so-called point spread function (PSF), so that an image J can be modeled 
by 

J(y) = J h(y, x)r(x)dx. (2) 

Although we did not write it explicitly, the PSF h depends on the surface s and 
the parameters of the optics (see section 3 for more details). 

Under the assumption that the PSF is Gaussian and that the surface s is smooth, 
we can substitute the above model with a PDE whose solution u : R 2 x [0, oo) i — > 
R, (x,t) i — y u(x,t), at each time t represents an image with a certain amount 
of blurring. In formulas, we have that J(y) = u(y,T), where T is related to the 
amount of blurring of J. We use the following anisotropic diffusion equation: 

( u(y,t) = V • ( D(y)Vu(y,t )) t > 0 

< u(y,0) = r(y) Vy G fl (3) 

[ D(y)Xu(y,t) ■ n = 0 



where D = 



dn di2 
C ?21 <^22 



with dij : R 2 i — > M for i,j = 1,2 and di 2 = ^ 21 , is called 

diffusion tensor. We assume that d %1 G C' 1 (R 2 ) (i.e. the space of functions with 
continuous partial derivatives in R 2 ) for i,j = 1,2, and 2 D(y) > 0 Vy G R 2 . 



The symbol V is the gradient operator 



d d 
dyi dy 2 



with y = [yi y 2 \ T , and 



the symbol V- is the divergence operator ^ 1=1 n denotes the unit vector 
orthogonal to dfl. Notice that there is a scale ambiguity between the time T 
and the determinant of the diffusion tensor D. We will set T = | to resolve this 
ambiguity. 

When the depth map s is a plane parallel to the image plane, the PSF h is a 
Gaussian with constant covariance a 2 , and it is easy to show that 2 tD = cr 2 Id, 
where Id is the 2x2 identity matrix. In particular, at time t = T = | we have 
D = a 2 Id- This model is fairly standard and was used for instance in [10]. 



2.2 An Imaging Model for Motion-Blur 



On the image plane we measure projections of three dimensional points in the 
scene. In other words, given a point X(t) = [-Xi(f) X 2 (t) Xfft)] G R 3 at a time 
instant t, we measure 



x{t) = [xfft) x 2 (f)] T = 



Xfft) X 2 (t) 
X 3 (t) Xfft) 



( 4 ) 



Since D is a tensor, the notation D(y) > 0 means that D(y ) is positive semi-definite. 



2 
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Using the projections of the points on the image plane x(t), we can write the 
coordinates of a point X (t) as 

m = \x{t) T if *(*(*)). (5) 



We denote with V = [Vj(f) V2(t) V 3 (t)f G R 3 the translational velocity and 
with u> € R 3 the rotational velocity of the scene. Then, it is well known that the 
time derivative of the projection x satisfies (see [11] for more details): 



x(t) 



s(x(t)) 



1 0 —x\(t) 
0 1 — x 2 (t) 



V- 



-1 — X%(t) Xi(t)x 2 (t) -x 2 (t) 
-X\(t)x 2 (t) 1 + X?(t) xi(t) 



( 6 ) 



We define v = x(t) and call it the velocity field. 

As we have anticipated, we restrict ourselves to a crude motion model that 
only represents translations parallel to the image plane, i.e. 



v(t) 



V\ , 2 (t) 
s{x{t)) 



(7) 



where V \ t2 is the velocity in focal length units. Now, recalling eq. (2), we have 
that J(x+vt) denotes an image captured at time t. If the camera shutter remains 
open while moving the camera with velocity V for a time interval AT , then the 
image / we measure on the image plane can be written as: 

AT 

J 0) = 2^ / ^ T J{x + vt)dt~ J —jL =e~*$J(x + vt)dt (8) 

where 7 depends on the time interval AT. The parameter 7 can be included in 
the velocity vector v since there is an ambiguity between the duration of the 
integration time and the magnitude of the velocity. Therefore, we have 

r 1 t 2 

I(x) = J e 2 J(x + vt)dt. (9) 

For simplicity, the above model has been derived for the case of a sideway trans- 
lational motion, but it is straightforward to extend it to the general case of 
eq. (6). 



2.3 Modeling Motion-Blur and Defocus Simultaneously 

In this section, we consider images where defocus and motion-blur occur simul- 
taneously. In the presence of motion, a defocused image J measured at time t 
can be expressed as 



J(y + vt) 



h(y + vt, x)r(x)dx. 



Following eq. (9), we obtain 



m 




1 

2ttcf 2 



(y-x+vt) T (y-x+vt) 
2 a 2 



(10) 



r{x)dxdt. 



( 11 ) 



Scene and Motion Reconstruction 



261 



If we now interchange the integration order, we can write the previous equation 
in a more compact way as 

T . . f 1 (y-x ) T C~ 1 (y-x) 

H,) = J i^T'" = Tix)iZ (12) 

where C = a 2 Id + vv T . 

Eq. (12) is also the solution of the anisotropic diffusion PDE (3) with initial con- 
dition the radiance r and diffusion tensor D = ( ;^ f . Hence, a model for defocused 
and motion-blurred images is the following: 

[ u{y, t) = V • ( D\7u(y , t)) t> 0 

< u{y,0) = r(y) My € 12 (13) 

[ D\7 u(y, t) ■ n = 0 

where at time t = T = D = C = a 2 Id + vv T . Now, it is straightforward to 
extend the model to the space-varying case, and have that 

D{y) = a 2 (y)I d + v(y)v(y) T . (14) 

In particular, when eq. (7) is satisfied, we have 

Vi 9 V T o 

D( y ) = a 2 ( y )Id+^§^. (15) 

« (y) 

Notice that the diffusion tensor just defined is made of two terms: cr 2 (y)Id and 
Vi 2^^ 

»’ 2 (i/)' 3 ‘ ^ erm corres P on ds to the isotropic component of the tensor, and 

captures defocus. The second term corresponds to the anisotropic component of 
the tensor, and it captures motion-blur. Furthermore, since both of the terms 
are guaranteed to be always positive semi-definite, the tensor eq. (15) is positive 
semi-definite too. We will use eq. (13) together with eq. (15) as our imaging 
model in all subsequent sections. 

2.4 Well-Posedness of the Diffusion Model 

A first step in the mathematical analysis is to verify the well-definedness of 
the parameter-to-output map (r, s, Vj . 2 ) 1 — > u(.,T ), which corresponds to a well- 
posedness result for the degenerate parabolic initial-boundary value problems 

f u(y, t) = V • ( D(y)Vu(y , t)) t> 0 

l u(y, 0) = r(y) (16) 

l D{y)S7u{y,t) ■ n = 0 

Vi 2 

for diffusion tensors of the form D(y) = a(y) 2 I d H — Y( L )^' 2 ■ n denotes the unit 
vector orthogonal to the boundary of 12. The following theorem guarantees the 
existence of weak solutions for the direct problem: 
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Theorem 1. Let r £ L 2 (f2) and s £ H 1 (f2) satisfies (1). Then, there exists a 
unique weak solution u £ C(0,T; L 2 (f2)) of (16), satisfying 







A(y)|Vu(y,s )| 2 



dy ds < 



r{y ) 2 dy, 



where A (y) > 0 denotes the minimal eigenvalue of D(y). 



(17) 



Proof. See technical report [9]. 



3 Estimating Radiance, Depth, and Motion 



In section 2.1 we introduced the variance cr 2 of the PSF h to model defocus. The 
variance a 2 depends on the depth map via er 2 (a;) = (f)" ^1 — p , 

where d is the aperture of the camera (in pixel units), p is the distance between 
the image plane and the lens plane, F is the focal length of the lens and s is the 
depth map of the scene. We simultaneously collect a number N of defocused and 
motion-blurred images {/i, . . . , In} by changing the parameter p = {pi, . . . ,pn}- 
Notice that the parameters Pi lead to different variances cr 2 {x), which affect the 
isotropic component of the diffusion tensor D , but not its anisotropic component 

Vi 2 V ^ 

a ’ 2 (a, 1 )’ 2 ' s h° wn * n section 2.3 we can represent an image fi by taking the 
solution Ui of eq. (13) at time t = T = 1/2 with a diffusion tensor Dfix) = 

af(x)Id H — a ’ 2 W 2 , and with initial condition ufiy, 0) = r(y) V* = 1 ... IV. 

We pose the problem of inferring the radiance r, the depth map s and the 
motion field v of the scene by minimizing the following least-squares functional 
with Tikhonov regularization (cf. [7]) 



N r 

r,s,V l ,2 = arg min Y' / (ufix, T) - Ifix)) 2 dx + a ||r - r*\\ 2 + (5 ||Vs || 2 + 
r,s,Vi, 2 ' J a 



+7(ll^,2|| ~MY 



(18) 



where a, (3, and 7 are positive regularization parameters, r* is a prior 3 for r and 
M is a suitable positive number 4 . One can choose the norm || • || depending on 
the desired space of solutions. We choose the L 2 norm for the radiance and the 
components of the gradient of the depth map and the l 2 norm for the velocity 
vector Vi )2 . In this functional, the first term takes into account the discrepancy 
between the model and the measurements; the second and third term are classical 
regularization functionals, imposing some regularity on the estimated depth map 

3 We do not have a preferred prior for the radiance r. However, it is necessary to 
introduce this term to guarantee that the estimated radiance does not diverge. In 
practice, one can use as a prior r* one of the input images, or a combination of them, 
and choose a very small a. 

4 Intuitively, the constant M is related to the maximum degree of motion-blur that 
we are willing to tolerate in the input data. 
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Fig. 1. Left: cost functional for various values of 14 and 14 when 7 = 0 or M = 0. 
Right: cost functional for various values of 14 and V 2 when 7 0 and M 7 ^ 0. In both 

cases the cost functional eq. (18) is computed for a radiance f and a depth map s away 
from the true radiance r and the true depth map s. Notice that on the right plot there 
are two symmetric minima for V 17 . This is always the case unless the true velocity 
satisfies 14 7 = 0 , since the true 14 7 can be determined only up to the sign. 



and penalizing large deviations of the radiance from the prior. The last term is of 
rather unusual form, its main objective being to exclude 14 7 = 0 as a stationary 
point. One easily checks that for 7 = 0 or M = 0, 14 7 = 0 is always a stationary 
point of the functional in (18), which is of course an undesirable effect. This 
stationary point is removed for positive values of M and 7 (see Figure 1). 



3.1 Cost Functional Minimization 



To minimize the cost functional (18) we employ a gradient descent flow. For 
each unknown we compute a sequence converging to a local minimum of 
the cost functional, i.e. we have sequences f(x, r), s(x, r), V4 i 2 (t), such that 
f(x) = lim f(x,T), s(x) = lim s(x,t), V4 j2 = lim t4, 2 (V). At each iteration 
we update^he unknowns by" moving in the opposite direction of the gradient 
of the cost functional with respect to the unknowns. In other words, we let 
df(x,T)/dr = -V?E(x), ds(x,T)/dr = - V§E(x ), 3 V 4 j 2 (t )/<9 t = -S7y i 2 E(x). 
It can be shown that the above iterations decrease the cost functional as r in- 
creases. The computation of the above gradients is rather involved, but yields 
the following formulas, that can be easily implemented numerically: 



N 



\7 r E = wt(x, 0) 



2 = 1 

N r T 



v,e = 2'£I UmJU 



Vl,2 Vl T 2 



2=1 ' 
N 



s*(xY 






Vui(x,t) ■ Vwi(x,t) dt (19) 



W,,£=-E/ /( 



X,2 Vl T 2 + ^ 1 , 2 ^ 

s 2 (x) 



Vui(x,t) ■ Vwi(x,t)) dx dt 



where Wi satisfies the following adjoint parabolic equation (see [9] for more cle- 
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Wi(y,t) = -V • ( Di(y)Vwi(y,t )) 

Wi(y,T) = Ui(y,T) - Ii(y) (20) 

(■ Di(y)Vwi(y,t )) • n = 0. 



4 Experiments 

The algorithm presented in section 3.1 is tested on both synthetic (section 4.1) 
and real (section 4.2) data. In the first case, we compare the estimated unknowns 
with the ground truth and establish the performance of the algorithm for differ- 
ent amounts of noise. In the second case, since we do not have the ground truth, 
we only present a qualitative analysis of the results. We implement the gradient 
flow equations in section 3.1 with standard finite difference schemes (see [18,1]). 

4.1 Synthetic Data 

In this first set of experiments, we consider a scene made of a slanted plane (see 
the leftmost image in Figure 4), that has one side at 0.52m from the camera 
and the opposite side at 0.85m from the camera. The slanted plane is painted 
with a random texture. We define the radiance r to be the image measured on 
the image plane when a pinhole lens is used (see first image from the left in 
Figure 2). The second image from the left in Figure 2 has been captured when 
the scene or the camera are subject to a translational motion while the camera 
shutter remains open. Notice that the top portion of the image is subject to 
a more severe motion-blur than the bottom part. This is due to the fact that 
in this case points that are far from the camera (bottom portion of the image) 
move at a slower speed than points that are close to the camera (top portion of 
the image). 

We simulate a camera that has focal length 0.012m and F-number 2. With 
these settings we capture two images: one by focusing at 0.52m, and the other by 
focusing at 0.85m. If neither the camera nor the scene are moving, we capture 
the two rightmost images shown in Figure 2. Instead, if either the camera or 
the scene are moving sideway, we capture the two leftmost images shown in 
Figure 3. The latter two are the images we give in input to our algorithm. In 
Figure 3 we show the recovered radiance when no motion-blur is taken into 
account (third image from the left) and when motion-blur is taken into account 
(rightmost image). As one can notice by visual inspection, the latter estimate 
of the radiance is sharper than the estimate of the radiance when motion-blur 
is not modeled. The improvement in the estimation of the radiance can also be 
evaluated quantitatively since we have ground truth. To measure the accuracy 
of the estimated radiance, we compute the following normalized RMS error: 

NRM S E ((j) estimated , time) = ~ ^ (21) 

WytrueW 

where <f> estimated is the estimated unknown, fa rU e is the ground truth and || • || 
denotes the L 2 norm. We obtain that the NRMSE between the true radiance 
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Original Motion-blurred 



Two defocused images 



Fig. 2. First from the left: synthetically generated radiance. Second from the left: 
motion-blurred radiance. This image has been obtained by motion-blurring the syn- 
thetic radiance on the left. Third and fourth from the left: defocused images from a 
scene made of the synthetic radiance in Figure 2 (leftmost) and depth map in Figure 4 
(leftmost) without motion-blur. 




Fig. 3. First and second from the left: defocused and motion-blurred images from a 
scene made of the synthetic radiance in Figure 2 (leftmost) and depth map in Figure 4 
(leftmost). Third from the left: recovered radiance from the two defocused and motion- 
blurred images on the left when no motion-blur is taken into account (Vi ,2 = 0). Fourth 
from the left: recovered radiance from the two defocused and motion-blurred images 
on the left when motion blur is taken into account (Fi ,2 ^ 0). 




Fig. 4. Left: true depth map of the scene. Middle: recovered depth map. Right: profile 
of the recovered depth map. As can be noticed, the recovered depth map is very close 
to the true depth map with the exception of the top and bottom sides. This is due to 
the higher blurring that the images are subject to at these locations. 



and the motion-blurred radiance (second image from the left in Figure 2) is 
0.2636. When we compensate only for defocus during the reconstruction, the 
NRMSE between the true radiance and the recovered radiance is 0.2642. As 
expected, since the motion-blurred radiance is the best estimate possible when 
we do not compensate for motion-blur, this estimated radiance cannot be more 
accurate than the motion-blurred radiance. Instead, when we compensate for 
both defocus and motion-blur, the NRMSE between the true radiance and the 
recovered radiance is 0.2321. This shows that the outlined algorithm can restore 
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Depth map estimation error (20 trials) Radiance reconstruction error (20 trials) 




Fig. 5. Left: depth map estimation for 5 levels of additive Gaussian noise. The plot 
shows the error bar for 20 trials with a staircase depth map and a random radiance. 
We compute the RMS error between the estimated depth and the true depth, and 
normalize it with respect to the norm of the true depth (see eq. (21)). Right: radiance 
estimation for 5 levels of additive Gaussian noise. As in the previous error bar, we 
compute the RMS error between the true radiance and the reconstructed radiance and 
then normalize it with respect to the norm of the true radiance. 




Fig. 6. First and second from the left: input images of the first data set. The two 
images are both defocused and motion-blurred. Motion-blur is caused by a sideway 
motion of the camera. Third from the left: recovered radiance. Fourth from the left: 
recovered depth map. 



images that are not only defocused, but also motion-blurred. The recovered 
depth map is shown in Figure 4 on the two rightmost images together with the 
ground truth for direct comparison (left). The true motion is V \ .2 = [0.8 0] T and 
the recovered motion is [0.8079 — 0.0713] T in focal length units. 

To test the performance and the robustness of the algorithm, we synthetically 
generate defocused and motion-blurred images with additional Gaussian noise. 
We use a scene made of a staircase depth map with 20 steps, with the first step 
at 0.52m from the camera and the last step at 0.85m from the camera. As in 
the previous experiment, we capture two images: one by focusing at 0.52m and 
the other by focusing at 0.85m. To each of the images we add the following 5 
different amounts of Gaussian noise: 0%, 0.5%, 1%, 2.5% and 5% of the radiance 
magnitude. For each noise level we run 20 experiments from which we compute 
the mean and the standard deviation of the NRMSE. The results are shown in 
Figure 5. 
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Fig. 7. First and second from the left: input images of the second data set. The two 
images are both defocused and motion-blurred. Motion-blur is caused by a sideway 
motion of the camera. Third from the left: recovered radiance. Fourth from the left: an 
image taken without motion-blur. 




Fig. 8. First from the left: estimated depth map visualized as a gray level intensity 
image. Second, third and fourth from the left: visualization of the estimated depth 
map from different viewing angles. The depth map is also texture mapped with the 
estimated radiance. 

4.2 Real Images 

We test the algorithm on two data sets. The first data set is made of the two real 
images shown in Figure 6. The scene is made of a box that is moving sideway. We 
simultaneously capture two images with a multifocal camera kindly lent to us by 
S. K. Nayar. The camera has an AF NIKKOR 35 mm Nikon lens, with F-number 
2.8. We capture the first image by focusing at 70 mm from the camera and the 
second image by focusing at 90mm from the camera. The scene lies entirely 
between 70 mm and 90mm. The estimated radiance is shown in Figure 6, together 
with the recovered depth map. The estimated motion is Vj .,2 = [0.5603 0.0101] r 
in units of focal length. In the second data set we use the two defocused and 
motion-blurred images in Figure 7 (first and second image from the left) captured 
with the same camera settings as in the first data set. The scene is composed of 
a banana and a bagel and the scene is moving sideways. The estimated radiance 
is shown in the third image from the left of the same figure. To visually compare 
the quality of the estimated radiance, we also add the fourth image from the 
left in Figure 7. This image has been obtained from about the same viewing 
point when neither the camera nor the scene was moving. Hence, this image is 
only subject to defocus. The reconstructed depth map is shown in Figure 8. The 
first image from the left is the depth map visualized as a gray level image. Light 
intensities correspond to points that are close to the camera and dark intensities 
correspond to points that are far from the camera. The next three images are 
visualizations of the depth map from different viewing angles with the estimated 
radiance texture mapped onto it. The estimated velocity for this data set is 
Vj .2 = [0.9639 — 0.0572] t , that corresponds to a sideway motion. 
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5 Summary and Conclusions 

In this manuscript we proposed a solution to the problem of inferring the depth, 
radiance and motion of a scene from a collection of motion-blurred and defocused 
images. First, we presented a novel model that can take into account both defocus 
and motion-blur (assuming motion is pure sideway translation), and showed that 
it is well-posed. Motion-blurred and defocused images are represented as the 
solution of an anisotropic diffusion equation, whose initial conditions are defined 
by the radiance and whose diffusion tensor encodes the shape of the scene, the 
motion field and the optics parameters. Then, we proposed an efficient algorithm 
to infer the unknowns of the model. The algorithm is based on minimizing the 
discrepancy between the measured blurred images and the ones synthesized via 
diffusion. Since the inverse problem is ill-posed, we also introduce additional 
Tikhonov regularization terms. The resulting method is fast and robust to noise 
as shown in the experimental section. 
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Abstract. To bridge the gap between low-level features and high-level 
semantic queries in image retrieval, detecting meaningful visual entities 
(e.g. faces, sky, foliage, buildings etc) based on trained pattern classifiers 
has become an active research trend. However, a drawback of the super- 
vised learning approach is the human effort to provide labeled regions 
as training samples. In this paper, we propose a new three-stage hybrid 
framework to discover local semantic patterns and generate their samples 
for training with minimal human intervention. Support vector machines 
(SVM) are first trained on local image blocks from a small number of 
images labeled as several semantic categories. Then to bootstrap the lo- 
cal semantics, image blocks that produce high SVM outputs are grouped 
into Discovered Semantic Regions (DSRs) using fuzzy c-means cluster- 
ing. The training samples for these DSRs are automatically induced from 
cluster memberships and subject to support vector machine learning to 
form local semantic detectors for DSRs. An image is then indexed as 
a tessellation of DSR histograms and matched using histogram inter- 
section. We evaluate our method against the linear fusion of color and 
texture features using 16 semantic queries on 2400 heterogeneous con- 
sumer photos. The DSR models achieved a promising 26% improvement 
in average precision over that of the feature fusion approach. 



1 Introduction 

Content-based image retrieval research has progressed from the feature-based 
approach (e.g. [9]) to the region-based approach (e.g. [5]). In order to bridge the 
semantic gap [20] that exists between computed perceptual visual features and 
conceptual user query expectation, detecting semantic objects (e.g. faces, sky, 
foliage, buildings etc) based on trained pattern classifiers has received serious 
attention (e.g. [15,16,22]. However, a major drawback of the supervised learn- 
ing approach is the human effort required to provide labeled training samples, 
especially at the image region level. Lately there are two promising trends that 
attempt to achieve semantic indexing of images with minimal or no effort of 
manual annotation (i.e. semi-supervised or unsupervised learning). 

In the field of computer vision, researchers have developed object recogni- 
tion systems from unlabeled and unsegmented images [8,19,25]. In the context 
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of relevance feedback, unlabeled images have also been used to bootstrap the 
learning from very limited labeled examples (e.g. [24,26]). For the purpose of 
image retrieval, unsupervised models based on “generic” texture-like descrip- 
tors without explicit object semantics can also be learned from images without 
manual extraction of objects or features [18]. As a representative of the state- 
of-the-art, sophiscated generative and probabilistic model has been proposed to 
represent, learn, and detect object parts, locations, scales, and appearances from 
fairly cluttered scenes with promising results [8] . 

Motivated from a machine translation perspective, object recognition is posed 
as a lexicon learning problem to translate image regions to corresponding words 
[7]. More generally, the joint distribution of meaningful text descriptions and 
entire or local image contents are learned from images or categories of images 
labeled with a few words [1,3,11,12]. The lexicon learning metaphor offers a new 
way of looking at object recognition [7] and a powerful means to annotate entire 
images with concepts evoked by what is visible in the image and specific words 
(e.g. fitness, holiday, Paris etc [12]). While the annotation results on entire im- 
ages look promising [12], the correspondence problem of associating words with 
segmented image regions remains very challenging [3] as segmentation, feature 
selection, and shape representation are critical and non-trivial choices [2]. 

In this paper, we address the issue of minimal supervision differently. We do 
not assume availability of text descriptions for image or image classes as in [3, 
12]. Neither do we know the object classes to be recognized as in [8]. We wish 
to discover and associate local unsegmented regions with semantics and gen- 
erate their samples to construct models for content-based image retrieval, all 
with minimal manual intervention. This is realized as a novel three-stage hybrid 
framework that interleave supervised and unsupervised learnings. First support 
vector machines (SVM) are trained on local image blocks from a small number 
of images labeled as several semantic categories. Then to bootstrap the local se- 
mantics, typical image blocks that produce high SVM outputs are grouped into 
Discovered Semantic Regions (DSRs) using fuzzy c-means clustering. The train- 
ing samples for these DSRs are automatically induced from cluster memberships 
and subject to local support vector machine learning to form local semantic de- 
tectors for DSRs. An image is indexed as a tessellation of DSR histograms and 
matched using histogram intersection. 

We evaluate our method against the linear fusion of color and texture fea- 
tures using 16 semantic queries on 2400 heterogeneous consumer photos with 
many cluttered scenes. The DSR implementation achieved a promising 26% im- 
provement in average precision over that of the feature fusion approach. 

The rest of the paper is presented as follows. We explain our local seman- 
tics discovery framework followed by the mechanisms for image indexing and 
matching in the next two sections respectively. Then we report and compare the 
results on the query-by-example experiments. Last but not least, we discuss the 
relevant aspects of our approach with other promising works in unsupervised 
semantics learning and issues for future research. 
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Fig. 1. A schematic digram of local semantics discovery 



2 Local Semantics Discovery 

Image categorization is a powerful divide-and-conquer metaphor to organize and 
access images. Once the images are sorted into semantic classes, searching and 
browsing can be carried out in more effective and efficient way by focusing only 
at relevant classes and subclasses. Moreover the classes provide context for other 
tasks. For example, for medical images, the context could be the pathological 
classes for diagnostic purpose [4] or imaging modalities for visualization purpose 
[14]. In this paper, we propose a framework to discover the local semantics that 
distinguish image classes and use these Discovered Semantic Regions (DSRs) 
to span a semantic space for image indexing. Fig. 1 depicts the steps in the 
framework which can be divided into three learning phases as described below. 

2.1 Learning of Local Class Semantics 

Given a content or application domain, some distinctive classes Ck with their 
image samples are identified. For consumer images used in our experiments, a 
taxonomy as shown in Fig. 2 has been designed. This hierarchy of 11 categories 
is more comprehensive than the 8 categories addressed in [23]. We select the 
7 disjoint categories represented by the leaf nodes (except the miscellaneous 
category) in Fig. 2 and their samples to train 7 binary support vector machines 
(SVM). The training samples are tessellated image blocks z from the class sam- 
ples. After learning, the class models would have captured the local class seman- 
tics and a high SVM output (i.e. Ck(z) 0) would suggest that the local region 
z is typical to the semantics of class k. 

In this paper, as our test data are heterogeneous consumer photos, we extract 
color and textures features for a local image block and denote this feature vector 
as z. Hence a feature vector z has two parts, namely, a color feature vector 
z c and a texture feature vector z*. For the color feature, as the image patch 
for training and detection is relatively small, the mean and standard deviation 
of each color channel is deemed sufficient (i.e. z c has 6 dimensions). In our 
experiments, we use the YIQ color space over other color spaces (e.g. RGB, 
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Fig. 2. Proposed hierarchy of consumer photo categories 



HSV, LUV) as it performed better in our experiments. For the texture feature, 
we adopted the Gabor coefficients which have been shown to provide excellent 
pattern retrieval results [13] . Similarly, the means and standard deviations of the 
Gabor coefficients (5 scales and 6 orientations) in an image block are computed 
as 2 * which has 60 dimensions. To normalize both the color and texture features, 
we use the Gaussian (i.e. zero-mean) normalization. 

The distance or similarity measure depends on the kernel adopted for the 
support vector machines. For the experimental results reported in this paper, we 
adopted polynomial kernels with the following modified dot product similarity 
measure between feature vectors y and z, 



= 1 , V C -z c y t -z t 

V Z 2 |y c || 2 c | \yt \\z t \ ) 



(1) 



2.2 Learning of Typical Semantic Partitions 

With the help of the learned class models Ck , we can generate sets of local 
image regions that characterize the class semantics (which in turn captures the 
semantic of the content domain) Xk as 

X k = {z\C k {z) >p} (p>0) (2) 

However, the local semantics hidden in each X k is opague and possibly multi- 
mode. We would like to discover the multiple groupings in each class by unsuper- 
vised learning such as Gaussian mixture modeling and fuzzy c-means clustering. 
The result of the clustering is a collection of partitions m k j, j = 1, 2, ■ ■ • , TV*, in 
the space of local semantics for each class, where m k j are usually represented as 
cluster centers and N k are the numbers of partitions for each class. 



2.3 Learning of Discovered Semantic Regions 

After obtaining the typical semantic partitions for each class, we can learn the 
models of DSRs S) i = 1, 2, • • • , N where N = Nk (i.e. linearize rrikj subscript 
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as mi). We label a local image block (x £ U k Xk) as positive example for Si if it 
is closest to rrii and as negative example for Sj j ^ i, 

Xf = {x\i = argrnin \x — mt |} (3) 

X~ = {x\i ^ arg min \x - m t \} (4) 

where |.| is some distance measure. Now we can perform supervised learning 
again on Xf and X~ using say support vector machines S,, (x) as DSR models. 

To visualize a DSR Si, we can display the image block Si that is most typical 
among those assigned to cluster rrii that belonged to class k, 

C k (si) = max C k (x) (5) 

oc£X+ 



3 Image Indexing and Matching 



Image indexing based on DSRs consists of three steps, namely detection, rec- 
onciliation, and aggregation. Once the support vector machines 5, have been 
trained, the detection vector T of a local image block z can be computed via the 
softmax function [6] as 



Ti(z) 



exp' 



Si(z) 



j exp s AU 



( 6 ) 



As each binary SVM is regarded as an expert on a DSR, the outputs of Si Vi 
is set to 0 if there exist some <S,, j ^ i has a positive output. That is, Tj- is close 
to 1 and Ti = 0 Vi ± j . 

To detect DSRs with translation and scale invariance in an image, the image 
is scanned with multi-scale windows, following the strategy in view-based object 
detection [17]. In our experiments, we progressively increase the window size 
from 20 x 20 to 60 x 60 at a step of 10 pixels, on a 240 x 360 size-normalized 
image. That is, after this detection step, we have 5 maps of detection. 

To reconcile the detection maps across different resolutions onto a common 
basis, we adopt the following principle: If the detection value of the most con- 
fident class of a region at resolution r is less than that of a larger region (at 
resolution r + 1) that subsumes the region, then the detection vector of the re- 
gion should be replaced by that of the larger region at resolution r + 1. Using 
this principle, we start the reconciliation from detection map based on largest 
scan window (60 x 60) to detection map based on next-to-smallest scan window 
(30 x 30). After 4 cycles of reconciliation, the detection map that is based on the 
smallest scan window (20 x 20) would have consolidated the detection decisions 
obtained at other resolutions. 

Suppose a region Z comprises of n small equal regions with feature vectors 
Zj,Z'2, ■ ■ ■ , z n respectively. To account for the size of detected DSRs in the area 
Z, the DSR detection vectors of the reconciled detection map are aggregated as 



T i{Z) = lY, T ^Zk). 

k 



( 7 ) 
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For query by examples, the content-based similarity A between a query q and 
an image x can be computed in terms of the similarity between their correspond- 
ing local regions. For example, the similarity based on Li distance measure (city 
block distance) between query q with m local regions Y) and image x with m 
local regions Zj is defined as 

X(q, X) = 1 - ]T E I W) - ( 8 ) 

j i 

This is equivalent to histogram intersection [21] except that the bins have se- 
mantic interpretation. In general, we can attach different weights to the regions 
(i.e. Yj,Zj) to emphasize the focus of attention (e.g. center). In this paper, we 
report experimental results based on even weights as grid tessellation is used. 
Also we have attempted various similarity and distance measures (e.g. cosine 
similarity, L 2 distance, Kullback-Leibler (KL) distance etc) and the simple city 
block distance in Equation (8) has the best performance. When a query has 
multiple examples, Q = {q±, q 2 , ■ ■ ■ , qx}, the similarity is computed as 

X(Q,x) = maxi\(qi,x) (9) 

4 Experimental Results 

In this paper, we evaluate our DSR-based image indexing approach on 2400 
genuine consumer photos, taken over 5 years in several countries with both 
indoor and outdoor settings. After removing possibly noisy margins, the images 
are size-normalized to 240 x 360. The indexing process automatically detects the 
layout and applies the corresponding tessellation template. In our experiments, 
the tessellation for detection of DSRs is a 4 x 4 grid of rectangular regions. Fig. 
3 displays typical photos in this collection. Photos of bad quality (e.g. faded, 
over-exposed, blurred, dark etc) (not shown here) are retained in order to reflect 
the complexity of the original data. 




Fig. 3. Sample consumer photos from the 2400 collection. They also represent 2 rele- 
vant images (top-down, left-right) for each of the 16 queries used in our experiments. 
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Table 1 . Training statistics of the semantic classes Ck for bootstrapping local seman- 
tics. The columns (left to right) list the class labels, the size of ground truth, the number 
of training images, the number of support vectors learned, the number of typical image 
blocks subject to clustering ( Ck{z ) > 2), and the number of clusters assigned. 



Class 


G.T. 


#trg 


#sv 


#data 


#clus 


inob 


134 


15 


1905 


1429 


4 


inpp 


840 


20 


2249 


936 


5 


mtrk 


67 


10 


1090 


1550 


2 


park 


304 


15 


955 


728 


4 


pool 


52 


10 


1138 


1357 


2 


strt 


645 


20 


2424 


735 


5 


wtsd 


150 


15 


2454 


732 


4 




Fig. 4. Most typical image blocks of the DSRs learned (left to right): china utensils 
and cupboard top (first four) for the inob class; faces with different background and 
body close-up (next five) for the inpp class; rocky textures (next two) for the mtrk 
class; green foliage and flowers (next four) for the park class; pool side and water (next 
two) for the pool class; roof top, building structures, and roadside (next five) for the 
strt class; and beach, river, pond, far mountain (next four) for the wtsd class. 



We trained 7 SVMs with polynomial kernels (degree 2, C = 100 [10]) for 
the leaf-node categories (except miscellaneous) on color and texture features 
(Equation (1)) of 60 x 60 image blocks (tessellated with 20 pixels in both direc- 
tions) from 105 sample images. Hence each SVM was trained on 16, 800 image 
blocks. After training, the samples from each class k is fed into classifier Ck 
to test their typicalities. Those samples with SVM output Ck(z) > 2 (Equa- 
tion (2)) are subject to fuzzy c-means clustering. The number of clusters as- 
signed to each class is roughly proportional to the number of training images 
in each class. Table 1 lists training statistics for these semantic classes: inob 
(indoor interior/objects), inpp (indoor people), mtrk (mountain/rocks), park 
(park/garden), pool (swimming pool), strt (street), and wtsd (waterside). We 
have 26 DSRs in total. 

To build the DSR models, we trained 26 binary SVM with polynomial kernels 
(degree 2, C = 100 [10]), each on 7467 positive and negative examples (Equations 
(3) and (4)) (i.e. sum of column 5 of Table 1). To visualize the 26 DSRs that 
have been learned, we compute the most typical image block for each cluster 
(Equation (5)) and concatenate their appearances in Fig. 4. Image indexing was 
based on the steps as explained in Section 3. 
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Table 2. Results of QBE experiments for 16 semantic queries (left to right): query id, 
query description, size of ground truth, average precisions based on random retrieval 
(RAND), linear fusion of color and texture features (CTO), and discovered semantic 
regions (DSRs) (the indexing for the last two methods are based on 4 x 4 grid). 



Query 


Description 


G.T. 


RAND 


CTO 


DSR 


Q01 


indoor 


994 


0.41 


0.62 


0.79 


Q02 


outdoor 


1218 


0.51 


0.78 


0.78 


Q03 


people close-up 


277 


0.12 


0.16 


0.33 


Q04 


people indoor 


840 


0.35 


0.59 


0.76 


Q05 


interior or object 


134 


0.06 


0.18 


0.32 


Q06 


city scene 


697 


0.29 


0.49 


0.59 


Q07 


nature scene 


521 


0.22 


0.35 


0.46 


Q08 


at a swimming pool 


52 


0.02 


0.18 


0.62 


Q09 


street or roadside 


645 


0.27 


0.50 


0.53 


Q10 


along waterside 


150 


0.06 


0.17 


0.32 


Qll 


in a park or garden 


304 


0.13 


0.71 


0.51 


Q12 


at mountain area 


67 


0.03 


0.28 


0.31 


Q13 


buildings close-up 


239 


0.10 


0.35 


0.30 


Q14 


close up, indoor 


73 


0.03 


0.15 


0.30 


Q15 


small group, indoor 


491 


0.20 


0.32 


0.45 


Q16 


large group, indoor 


45 


0.02 


0.29 


0.29 



We defined 16 semantic queries and their ground truths (G.T.) among the 
2400 photos (Table 2). In fact, Fig. 3 shows, in top-down left-to-right order, 2 rel- 
evant images for queries Q01-Q16 respectively. As we can see from these sample 
images, the relevant images for any query considered here exhibit highly var- 
ied and complex visual appearance. There is usually no dominant homogeneous 
color or texture region and they pose great difficulty for image segmentation. 
Hence to represent each query, we selected 3 (i.e. K = 3 in Equation (9)) rele- 
vant photos as query examples for Query By Example (QBE) experiments since 
a single query image is far from satisfactory to capture the semantic of any query 
and single query images have indeed resulted in poor precisions and recalls in 
our initial experiments. The precisions and recalls were computed without the 
query images themselves in the lists of retrieved images. 

In our experiments, we compare our local semantic discovery approach (de- 
noted as “DSR”) with the feature-based approach that combines color and tex- 
ture in a linearly optimal way (denoted as “CTO”). All indexing are carried out 
with a 4 x 4 grid on the images. 

For the color-based signature, color histograms of b 3 (b = 4, 5, • • • , 17) number 
of bins in the RGB color space were computed on an image. The performance 
peaked at 2197 (b = 13) bins with average precision (over all recall points) P aV g = 
0.38. Histogram intersection [21] was used to compare two color histograms. For 
the texture-based signature, we adopted the means and standard deviations of 
Gabor coeffients and the associated distance measure as reported in [13]. The 
Gabor coefficients were computed with 5 scales and 6 orientations. Convolution 
windows of 20 x 20, 30 x 30, • • • , 60 x 60 were attempted. The best performance 
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Table 3. Comparison of average precisions at top numbers of retrieved images. The 
last row compares the precisions averaged over all 16 queries. The last column shows 
the relative improvement in percentage. 



Avg.Prec. 


CTO 


DSR 


% 


At 20 


0.64 


0.71 


10 


At 30 


0.59 


0.68 


15 


At 50 


0.52 


0.63 


21 


At 100 


0.46 


0.57 


24 


overall 


0.38 


0.48 


26 



was obtained when 20 x 20 windows were used with P avg = 0.24. The distance 
measures between a query and an image for the color and texture methods 
were normalized within [0, 1] and combined linearly. Among the relative weights 
attempted at 0.1 intervals, the best fusion was obtained at P avg = 0.38 with a 
dominant influence of 0.9 from the color feature. 

As shown in Table 2, the DSR approach outperformed or matched the av- 
erage precisions of the CTO method in all queries except Qll and Q13. The 
random retrieval method (i.e. G.T./2400) (denoted as “RAND”) was used as a 
baseline comparison. In particular, the DSR approach more than doubled the 
performance of RAND and surpassed the average precisions of CTO by at least 
0.1 in more than half of the queries (Q03-08, Q10, Q14-15). Averaged over all 
queries, the DSR approach achieved a 26% improvement in precision over that 
of CTO (Table 3). As depicted in the same table, DSR is also consistently better 
than CTO in returning more relevant images at top numbers of images for prac- 
tical applications. As an illustration, Fig. 5 and Fig. 6 show the query examples 
and top 18 retrieved images for query Q08 respectively. All retrieved images 
except image 18 are considered relevant. 




Fig. 5. Query images for Q08. 




Fig. 6. Top 18 retrieved images by DSR for query Q08. 
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5 Discussion 

For the current implementation of our DSR approach, there are still several issues 
to be addressed. We can improve the sampling of image blocks for semantic class 
learning by randomly selecting say 20% of the ground truth images in each class 
as positive samples (and as negative samples for all other classes) as well as 
by tessellating image blocks with different sizes (e.g. 20 x 20,30 x 30 etc) and 
displacements (e.g. 10 pixels) to generate a more complete and denser coverage 
of the local semantic space. But these attempts turned out to be too ambitious 
for practical training. 

Another doubt is the usefulness of the semantic class learning in the first 
place. Can we perform clustering of image blocks in each class directly (i.e. 
without worrying about Ck(z) > p)l The result was indeed inferior (with average 
precision of 0.39) for the QBE experiments. Hence the typicality criterion is 
important to pick up the relevant hidden local semantics for discovery. 

Cluster validity is a tricky issue. We have tried fixed number of clusters 
(e.g. 3, 4, 5, 7) and retained large clusters as DSRs. Alternatively we relied on 
human inspection to select perceptually distinctive clusters (as visualized using 
Equation (5)) as DSRs. However the current way of assigning number of clusters 
roughly proportional to the number of training images has produced the best 
performance in our experiments. In future, we would explore other ways to model 
DSRs (e.g. Gaussian mixture) and to determine the value of p. We would also 
like to verify our approach on other content domains such as art images, medical 
images etc to see if the DSRs make sense to the domain experts. 

Although our attempt to alleviate the supervised learning requirement of la- 
beled images and regions differs from the current trends of unsupervised object 
recognition and matching words with pictures, the methods do share some com- 
mon techniques. For instance, similar to those of Schmid [18] and Fergus et al. 
[8] , our approach computes local region features based on tessellation instead of 
segmentation though [8] used an interest detector and kept the number of fea- 
tures below 30 for practical implementation. While Schmid focused on “Gabor- 
like” features [18] and Fergus et al. worked on monochrome information only 
[8], we have incorporated both color and texture information. As the clusters in 
[18] were generated by unsupervised learning only, they may not correspond to 
well-perceived semantics when compared to our DSRs. As we are dealing with 
cluttered and heterogeneous scenes, we did not model object parts as in the 
comprehensive case of [8]. On the other hand, we handle scale invariance with 
multi-scale detection and reconciliation of DSRs during image indexing. Last 
but not least, while the generative and probabilistic approaches [8,12] may enjoy 
modularity and scalability in learning, they do not exploit inter-class discrimi- 
nation to compute features unique to classes as in our case. 

For the purpose of image retrieval, the images signatures based on DSRs 
realize semantic abstraction via prior learning and detection of visual classes 
when compared to direct indexing based on low-level features. The compact rep- 
resentation that accommodates imperfection and uncertainty in detection also 
resulted in better performance than the fusion of very high dimension of color 
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and texture features in our query-by-example experiments. Hence we feel that 
the computational resources devoted to prior learning of DSRs and their detec- 
tion during indexing are good trade-off for concise semantic representation and 
effective retrieval performance. Moreover, the small footprint of DSR signatures 
has an added advantage in storage space and retrieval efficiency. 

6 Conclusion 

In this paper, we have presented a hybrid framework that interleaves supervised 
and unsupervised learning to discover local semantic regions without image seg- 
mentation and with minimal human effort. The discovered semantic regions serve 
as new semantic axes for image indexing and matching. Experimental query-by- 
example results on 2400 genuine consumer photos with cluttered scenes have 
shown that images indexes based on the discovered local semantics are more 
compact and effective over linear combination of color and texture features. 
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Abstract. An approach to recognizing hand gestures from a monocular 
temporal sequence of images is presented. Of particular concern is the 
representation and recognition of hand movements that are used in single 
handed American Sign Language (ASL) . The approach exploits previous 
linguistic analysis of manual languages that decompose dynamic gestures 
into their static and dynamic components. The first level of decompo- 
sition is in terms of three sets of primitives, hand shape, location and 
movement. Further levels of decomposition involve the lexical and sen- 
tence levels and are part of our plan for future work. We propose and 
demonstrate that given a monocular gesture sequence, kinematic fea- 
tures can be recovered from the apparent motion that provide distinctive 
signatures for 14 primitive movements of ASL. The approach has been 
implemented in software and evaluated on a database of 592 gesture se- 
quences with an overall recognition rate of 86.00% for fully automated 
processing and 97.13% for manually initialized processing. 



1 Introduction 

Interest in automated gesture recognition has the potential to create powerful 
human computer interfaces. Computer vision provides methods to acquire and 
interpret gesture information while being minimally obtrusive to the participant. 
To be useful, methods must be accurate in recognition with rapid execution to 
support natural interaction. Further, scalability to encompass the large range 
of human gestures is important. The current paper presents an approach to 
recognizing human gestures that leverages both linguistic theory and computer 
vision methods. Following a path taken in the speech recognition community for 
the interpretation of speech [22], we appeal to linguistics to define a finite set 
of contrastive primitives, termed phonemes, that can be combined to represent 
an arbitrary number of gestures. This ensures that the developed approach is 
scalable. Currently, we are focused on the representation and recovery of the 
movement primitives derived from American Sign Language (ASL). This same 
linguistics analysis has also been applied to other hand gesture languages (e.g. 
French Sign Language). To affect the recovery of these primitives, we make use 
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of robust, parametric motion estimation techniques to extract signatures that 
uniquely identify each movement from a monocular input video sequence. Here, 
it is interesting to note that human observers are capable of recovering the 
primitive movements of ASL based on motion information alone [21]. For our 
case, empirical evaluation suggests that algorithmic instantiation of these ideas 
has sufficient accuracy to distinguish the target set of ASL movement primitives, 
with modest processing power. 



1.1 Related Research 

Significant effort in computer vision has been marshalled in the investigation 
of human gesture recognition (see [1,20] for general reviews); some examples 
follow. State-space models have been used to capture the sequential nature of 
gestures by requiring that a series of states estimated from visual data must 
match in sequence, to a learned model of ordered states [7]. This general approach 
also has been used in conjunction with parametric curvilinear models of motion 
trajectories [6]. An alternative approach has used statistical factored sampling 
in conjunction with a model of parameterized gestures for recognition [5]; this 
approach can be seen as an application and extension of the CONDENSATION 
approach to visual tracking [14]. Further, several approaches have used Hidden 
Markov Models (HMMs) [17,24,26], neural networks [10] or time-delay neural 
networks [31] to learn from training examples (e.g., based on 2D or 3D features 
extracted from raw data) and subsequently recognize gestures in novel input. 

A number of the cited approaches have achieved interesting recognition rates, 
albeit often with limited vocabularies. Interestingly, many of these approaches 
analyze gestures without breaking them into their constituent primitives, which 
could be used as in our approach, to represent a large vocabulary from a small 
set of generative elements. Instead, gestures are dealt with as wholes, with pa- 
rameters learned from training sets. This tack may limit the ability of such 
approaches to generalize to large vocabularies as the training task becomes inor- 
dinately difficult. Additionally, several of these approaches make use of special 
purpose devices (e.g., coloured markers, data gloves) to assist in data acquisition. 

In [2,28], two of the earliest efforts of using linguistic concepts for the descrip- 
tion and recognition of both general and domain specific motion are presented. 
Recently, at least two lines of investigations have appealed to linguistic theory 
as an attack on issues in scaling gesture recognition to sizable vocabularies [18, 
30]. In [18] the authors use data glove output as the input to their system. Each 
phoneme, from the parameters shape, location, orientation and movement, is 
modelled by an HMM based on features extracted from the input stream, with 
an 80.4% sentence accuracy rate. In [30] to affect recovery, 3D motion is ex- 
tracted from the scene by fitting a 3D model of an arm with the aid of three 
cameras in an orthogonal configuration (or a magnetic tracking system). The 
motion is then fed into parallel HMMs representing the individual phonemes. 
The authors report that by modelling gestures by phonemes, the word recog- 
nition rate was not severely diminished, 91.19% word accuracy with phonemes 
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versus 91.82% word accuracy using word-level modelling. The results thus lend 
credence to modelling words by phonemes in vision-based gesture recognition. 

1.2 Contributions 

The main contributions of the present research are as follows. First, our approach 
models gestures in terms of their phonemic elements to yield an algorithm that 
recognizes gesture movement primitives given data captured with a single video 
camera. Second, our approach uses the apparent motion of an unmarked hand 
as input as opposed to fitting a model of a hand (arm) or using a mechanical 
device (e.g. data glove, magnetic tracker). Third, our recognition scheme is based 
on a nearest neighbour match to prototype signatures, where each of 14 move- 
ment primitives of ASL is found to have a distinctive prototype signature in a 
kinematic feature space. We have evaluated our approach empirically with 592 
video sequences and find an 86.00% phoneme accuracy rate for fully automated 
processing and 97.13% for manually initialized processing even as other aspects 
of the gesture (hand shape and location) vary. 

1.3 Outline of Paper 

This paper is subdivided into four main sections. This first section has provided 
motivation for modelling gestures at the phoneme level. Section 2 describes the 
linguistic-basis of our representation as well as the algorithmic aspects of the ap- 
proach. Section 3 documents empirical evaluation of our algorithm instantiation. 
Finally, Section 4 provides a summary. 

2 Technical Approach 

Our approach to gesture recognition centres around two main ideas. First, lin- 
guistic theory can be used to define a representational substrate that system- 
atically decomposes complex gestures into primitive components. Second, it is 
desirable to recover the primitives from data that is acquired with a standard 
video camera and minimal constraints on the user. Currently, we are focused on 
the recovery of the linguistically defined rigid single handed movement primi- 
tives of American Sign Language (ASL). The input is a temporal sequence of 
images that depicts a single movement phoneme. The output of our system is a 
classification of the depicted gesture as arising from one of the primitive move- 
ments, irrespective of other considerations (e.g., irrespective of hand location 
and shape). The location of the hand in the initial frame is obtained through 
an automated localization process utilizing the conjunction of temporal change 
and skin colour. We assume that the hand is the dominant moving object in the 
imaged scene as an aid to localization. To affect the recognition, a robust, affine 
motion estimator is applied to regions of interest defined by skin colour and 
temporal change on a frame-to-frame basis. The resulting time series of affine 
parameters are individually accumulated across the sequence to yield a signature 
that is used for classification of the depicted gesture. Details of the movement 
gesture vocabulary and the processing stages are presented next. 
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Fig. 1 . Stokoe’s phonemic analysis of ASL. The left panel (A) depicts the signing space 
where the locations reside. Shaded regions indicate locations used in our experiments. 
The upper right panel (B) depicts possible hand shapes. Circled shapes indicate shapes 
used in our experiments. The lower right panel (C) depicts possible single handed 
movements (a) upward (b) downward (c) rightward (d) leftward (e) toward signer (f) 
away signer (g) nod (h) supinate (i) pronate (j) up and down (k) side to side (1) twist 
wrist (m) circular (n) to and fro. The solid ellipse, dashed ellipse and dashed arrow 
represent the initial hand location, the final location and the path taken respectively. 
We investigate the recognition of movement independent of location and shape. 



2.1 Linguistics Basis 

Prior to William Stokoe’s seminal work in ASL [27], it was assumed by linguists 
that the sign was the basic unit of ASL. Stokoe redefined the basic unit of a 
sign to units analogous to speech phonemes: minimally contrastive patterns that 
distinguish the symbolic vocabulary of a language. Stokoe’s system consists of 
three parameters that are executed simultaneously to define a gesture, see Fig. 
1. The three parameters capture location, hand shape and movement. There are 
12 elemental locations defined by Stokoe residing in a volume in front of the 
signer termed the “signing space” . The signing space is defined as extending 
from just above the head to the hip area in the vertical axis and extending close 
to the extents of the signer’s body in the horizontal axis (see Fig. 1A). There are 
19 possible hand shapes (see Fig. IB). While Stokoe’s complete vocabulary of 
movements consists of 24 primitives (i.e. single and two-handed movements), as a 
starting point, we restrict consideration to the 14 rigid single handed movements, 
shown in Fig. 1C. Current ASL theories still recognize the Stokoe system’s ba- 
sic parameters but differ in their definition of the constituent elements of the 
parameters [29]. We use Stokoe’s definition of the parameters since they are gen- 
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erally agreed to represent an important approximation to the somewhat wider 
and finer grained space that might be required to capture all the subtleties of 
hand gesture languages. 

2.2 Motion Estimation 

Let /(x,f) represent the image brightness at position x = (x,y) T and time t. 
Using the brightness constancy constraint [12], we define the inter-frame motion, 
u(x) = (u(x),u(x)) T , as, 



7(x,t + 1) = J(x — u(x),t) (1) 

We employ an affine model to describe the motion, 

u(x,y) = a 0 + a L x + a 2 y, v(x,y) = a 3 + a^x + a 5 y (2) 

We make use of the affine model for two main reasons. First, through an ana- 
lytic derivation we found that there exists a unique mapping between Stokoe’s 
qualitative description of the movement of the hand in the world and the first- 
order kinematic decomposition of the corresponding visual motion fields. The 
first-order kinematic description includes the following measures, (differential) 
translation, rotation, isotropic expansion/contraction and shear: Cases (shown 
in Fig. 1C) a-d, j, k and m are characterized by translation, for m horizontal and 
vertical translation oscillate out of phase (see Fig. 2); cases h, i and 1 involve rota- 
tion; cases e, f and n are characterized by expansion/contraction; case g involves 
shear and contraction. Due to space considerations the derivation has been omit- 
ted, for details see [8]. Second, over the small angular extent that encompasses 
the hand at comfortable signing distances from a camera, small movements can 
be approximated with an affine model. To affect the recovery of the affine param- 
eters we make use of a robust, hierarchical, gradient-based motion estimator [4] 
operating over a Gaussian pyramid [15]. The hierarchical nature of the estimator 
allows us to handle significant magnitude image displacements with computa- 
tional efficiency even while avoiding local minima. This estimator is applied to 
skin colour defined regions of interest in a pair of images under consideration. 
We use skin colour to restrict consideration to image data that arises from the 
hand; such regions are extracted using a Bayesian maximum-likelihood classifier 
[32]. As a further level of robustness we restrict consideration to points that 
experience a significant change in intensity (i.e. dl/dt). For robustness in mo- 
tion estimation, we make use of an M-estimator [13] (e.g., as opposed to a more 
standard least-squares approach, c.f., [3]) to allow for operation in the presence 
of outlying data in the form of non-hand pixels due to skin-colour oversegmen- 
tation, pixels that grossly violate the affine approximation as well as points that 
violate brightness constancy. The particular error norm we choose is the Geman- 
McClure [13]. 

The motion estimator is applied to adjacent frames across an image sequence. 
As an initial seed, the hand region in the first frame of the sequence is outlined 
by an automated process that consists of: utilizing the conjunction of skin colour 
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detection and change detection (i.e. dl / dt) to define a map of likely regions where 
the hand may reside, followed by a morphologically-based shape analysis [15] for 
the hand itself that seeks the region within the skin/change map containing the 
maximum circular area. No manual intervention is present. Upon recovering the 
motion between the first pair of frames, the analysis window is moved based on 
the affine parameters found (initialized identically to zero at the first frame) , the 
affine parameters are used as the initial parameters for the motion estimation 
of the next pair of images and the motion estimation process is repeated. When 
the motion estimator reaches the end of the image sequence, six time series, each 
representing an affine parameter over the length of the sequence, are realized. 

2.3 Kinematic Features 

Owing to their descriptive power in the current context, it is advantageous to 
rewrite the affine parameters in terms of kinematic quantities corresponding to 
horizontal and vertical translation, divergence, curl and deformation (see, e.g., 
[16]). In particular, from the coefficients in the affine transformation (2) we 
calculate the following time series, 



Each of the kinematic time series (3) has an associated unit of measurement 
(e.g. horizontal/vertical motion are in pixel units) that may differ amongst each 
other. To facilitate comparisons across the time series for the purposes of recog- 
nition, a rescaling of responses is appropriate. We make use of min-max rescaling 
[11], defined as, 



with mini an d max i the minimum and maximum values (resp.) in the input 
data z, while mini and max i specifying the range of the rescaled data taken over 
the entire population sample. For scaling ranges, we select [—1,1] for elements 
of (3) that range symmetrically about the origin and [0, 1] for those with one 
sided responses, i.e., def. 

To complete the definition of our kinematic feature set, we accumulate pa- 
rameter values across each of the five rescaled kinematic time series, hor(t), 
ver(t), div(t), curl(t ), def(t) and express each resulting value as a proportion. 
The accumulation procedure is motivated by the observation that there are two 
fundamentally different kinds of movements in the vocabulary defined in Fig. 1: 
those that entail constant sign movements, i.e., movements (a-i), which are uni- 
directional; those that entail periodic motions, i.e., movements (j-n), which move 



hor(t ) 
ver(t) 
div(t) 
curl{t) 



a 0 {t) 

a 3 (t) 

a\(t) + a 5 (f) 
-a 2 {t) + a 4 (f) 



( 3 ) 



def(t) = sj (ai(f) - a 5 (f)) 2 + (a 2 (f) + a 4 (f)) 2 
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“back and forth”. To distinguish these differences, we accumulate our parameter 
values in two fashions. 

First, to distinguish constant sign movements, we compute a summed re- 
sponse , SRi, 

T 

SRi ^ Pi.t 

t=l 

where i £ { hor , ver, div, curl , def} indexes a time series, T represents the num- 
ber of frames a gesture spans and pi tt represents the value of (rescaled) time 
series i at time t. Constant sign movements should yield non-zero magnitude 
SRi , for some i: whereas, periodic movements will not as their changing sign 
responses will tend to cancel across time. 

Second, to distinguish periodic movements, we compute a summed absolute 
response , SARi, 

T 

SARi = I PmI; where p = p ift mearii 

t=i 

where meaUi represents the mean value of (rescaled) time series i. Now, constant 
sign movements will have relatively small SARi , for all i (given removal of the 
mean, assuming a relatively constant velocity); whereas, periodic movements 
will have significantly non-zero responses as the subtracted mean should be near 
zero (assuming approximate symmetry in the underlying periodic pattern) and 
the absolute responses now sum to a positive quantity. 

Due to the min-max rescaling (4), the SRi and SARi calculated for any 
given gesture sequence are expressed in comparable ranges on an absolute scale 
established from consideration of all available data (i.e. , mini and max \ are set 
based on scanning across the entire sample set). For the evaluation of any given 
gesture sequence, we need to represent the amount of each kinematic quantity 
observed relative to the others in that particular sequence. For example, a (e.g., 
very slow) vertical motion in the absence of any other motion should be taken 
as significant irrespective of the speed. To capture this notion, we convert the 
accumulated SRi and SARi values to proportions by dividing each computed 
value by the sum of its consort, formally, 

SRPi = SRi/C^\ SR k\), SARR = SAR t /(J2 SAR k ) (5) 

k k 

with k ranging over hor, ver, div, curl, def . Here, SRPi represents the summed 
response proportion of SR parameter i and SARPi represents the summed abso- 
lute response proportion of SAR parameter i. Notice that the min-max rescaling 
accomplished through (4) and the conversion to proportions via (5) accomplish 
different goals, both of which are necessary: the former brings all the kinematic 
variables into generally comparable units; the latter adapts the quantities to a 
given gesture sequence. In the end, we have a 10 component feature set SRPi 
and SARPi, i £ {hor , ver, div, curl, def} that encapsulates the kinematics of 
the imaged gesture. 
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Table 1 . Gesture signatures. Each movement phoneme has a distinctive prototype 
signature defined in terms of our kinematic feature set. Kinematic features and move- 
ment phonemes are plotted along vertical and horizontal axes, resp. The SRP and 
SARP values are defined with respect to formula (5). 
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2.4 Prototype Gesture Signatures 

Given our kinematic feature set, each of the primitive movements for ASL, shown 
in Fig. 1C has a distinctive idealized signature based on (separate) consideration 
of the SRPi and SARPi values (see Table 1). Analytical relationships between 
the 2D kinematic signatures and the 3D hand movements are presented in [8]. 

Distinctive signatures for the constant sign movements (i.e., movements a-i 
in Fig. 1C) are defined with reference to the SRPi values. Upward/downward 
movements result in responses to ver(t) alone; hence, of all the SRi, only SR V er 
should have a nonzero value in (5), leading to a signature of \SRP V g r \ = 1 while 
\SRPi\ = 0 ,i / ver. In order to disambiguate between upward and downward 
movements, the sign of SRP V er is taken into account, positive sign for downward 
and negative for upward. Similarly, rightward/leftward movements result in sig- 
nificant response to hor(t) alone, with the resulting signature of \SRP h ~ or \ = 1 
while IS'.RPil = 0 hor and positive and negative signed SRP h ~ or correspond- 
ing to rightward and leftward movements, resp. The toward/away signer move- 
ments are manifest as significant responses in div(t) alone. Correspondingly, 
| SRP d i v \ = 1 while other values are zero. For this case, positive sign on SRP d ^ v 
is indicative of toward, while negative sign indicates away. The supinate/pronate 
gestures map to significant responses in curl(t) alone. Here, \SRP c ^ rl \ = 1 while 
other values are zero with positively and negatively signed SRP c ^ rl indicating 
supinate and pronate, resp. Unlike the other movements described so far, nod 
has two significant kinematic quantities which have constant signed responses 
throughout the gesture, namely def(t) and div(t). The sign of def(t ) should be 
positive, while the sign of div(t) should be negative, i.e., contraction. Further, 
the magnitudes of these two nonzero quantities should be equal. Therefore, we 
have \SRP d i v \ = \SRP de f\ = 0.5 with all other responses zero. 

For periodic movements (i.e., movements j-n in Fig. 1C) distinctive signa- 
tures are defined with reference to the SARPi values. The definitions unfold 
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analogously to those for the constant sign movements, albeit sign now plays no 
role as the SARP ( are all positive by construction. An up and down movement 
maps directly to uer(f), resulting in a value of SARP v - er equal to 1 with other 
summed absolute response proportions zero. The side to side movement directly 
maps to hor(t ), resulting in a value of SARP h ~ or equal to 1 while other values are 
zero. The to and fro movement maps directly to div(t) : resulting in a value of 
SARP div equal to 1 with other summed absolute response proportions zero. The 
twist wrist movement directly maps to curl(t ), resulting in a value of SARP c ^ rl 
equal to 1 with other values zero. The circular movement has two prominent 
kinematic quantities, hor{t ) and ver(t). As the hand traces a circular trajec- 
tory, these two quantities will oscillate out of phase with each other (see Fig. 2). 
Across a complete gesture the two summed absolute responses are equal. The 
overall signature is thus SARP h ~ or = SARP v i r = 0.5, with all other values zero. 

For classification, we first calculate the Euclidean distance between our in- 
put signatures (i.e. SRPi and SARPi ) and their respective stored prototypical 
signatures. The result is a set of distances dj (14 in total). Taking the smallest 
distance as the classified gestures is not sufficient, since it presupposes that we 
know whether the classification is to be done with respect to the SRPi (con- 
stant sign cases) or the SARPi (periodic cases). This ambiguity can be resolved 
through re-weighting the distances by the reciprocal norm of their respective 
feature vectors, formally, 

dj = (1/| SR|) x dj ; where j € {constant sign distance} 

dj = (1/|SAR|) x dj ; where j £ {periodic distances} 

with 

SR = ( SR h ~ or , SR V e r , SR div , SR cx j rl , SR d ~ e j) 

SAR = ( SAR h ~ or , SAR V e r , SAR . djv , SAR c/t ~ rl . SAR de j) 

Intuitively, if the norm of SR is greater than that of SAR, then the movement 
is more likely to be a constant sign; if the relative magnitudes are reversed then 
the movement is more likely to be a periodic. Following the re- weighting, the 
movement with the smallest dj value is returned as the classification. Finally, 
for movements classified by distance as nod, we explicitly check to make sure 
\SRP div \ ~ | SRP (jf ' f | , if not we take the next closest movement. Similarly, for 
circular we enforce that SARP h ~ or « SARP V f, r . These explicit checks serve to 
reject misclassifications when noise happens to artificially push estimated feature 
value patterns toward the nod and circular signatures. 

3 Empirical Evaluation 

To test the viability of our approach, we have tested a software realization of our 
algorithm on a set of video sequences each of which depicts a human volunteer 
executing a single movement phoneme. Here, our goal was to test the ability 
of our algorithm to correctly recognize movement, irrespective of the volunteer, 
hand location and shape of the complete gesture. Owing to the descriptive power 
of the phonemic decomposition of gestures into movement, location and shape 
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primitives, consideration of all possible combinations would lead to an experi- 
ment that is not feasible. 1 Instead, we have chosen to subsample the hand shape 
and location dimensions by exploiting similarities in their respective configura- 
tions. For location we have selected whole head, torso and upper arm, see Fig. 
1A. These choices allow a range of locations to be considered and also introduce 
interesting constraints on how movements are executed. For instance, when the 
hand begins at the upper arm location, the natural tendency is to have the wrist 
rotated such that the hand is at a slight angle away from the body; as the hand 
moves towards the opposite side of the body, a slight rotation is introduced to 
bring the hand roughly parallel with the camera. For hand shape, we have se- 
lected A, B5, K and C, see Fig. IB. The rationale for selecting hand shapes A, 
B5 and K is as follows: A (i.e. fist) and B5 (i.e. open flat hand) represent the two 
extremes of the hand shape space, whereas K (i.e. victory sign) represents an 
approximate midpoint of the space. Hand shape C has been included since it is 
a clear example of a hand shape being non-planar. This sampling leaves us with 
a total possible number of test cases equal to 14 (movements) x 3 (locations) x 
4 (shapes) = 168. However, several of these possibilities are difficult to realize 
(e.g., pronating movement at the upper arm location); so, dropping these leaves 
us with a total of 148 cases. Three volunteers each executed all 148 movements 
while their actions were recorded with a video camera to yield an experimental 
test set of 3 x 148 = 444. In addition, 12 volunteers executed an approximate 
equal subset of the gesture space (approximately 14 gestures each). In total our 
experimental test set consisted of 592 gestures. It should be noted that the vol- 
unteers were fully aware of the camera and their expected position with respect 
to it, this allowed precise control of the experimental variables for a systematic 
empirical test. With an eye toward applications such control is not unrealistic: A 
natural signing conversation consists of directing one’s signing towards the other 
signer (in this case a camera). During acquisition, standard indoor, overhead flu- 
orescent lighting, was used and the normal (somewhat cluttered) background in 
our lab was present as volunteers signed in the foreground. Each gesture se- 
quence was captured at a resolution of 640x480 pixels at 30 frames per second; 
for processing, the gesture sequences were subsampled temporally by a factor of 
two resulting in a frame rate of 15 frames per second. Typically, the hand region 
encompasses a region in a frame with dimensions approximately 100 pixels in 
both width and height. On average the gesture sequences spanned 40 frames for 
constant sign movements and 80 for periodic movements. Prior to conducting 
the gesture each volunteer was verbally described the gesture. This was done 
in order to ensure the capture of naturally occurring extraneous motions which 
can appear when an unbiased person performs the movements. See Fig. 2 for an 
example sequence. 



1 Using Stokoe’s parameter definitions there would be 14 (movements) x 19 (shapes) 
x 12 (locations) = 3192 combinations for each volunteer. 
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Fig. 2. Circular movement example. A circular movement image sequence with its 
accompanying kinematic time series plotted. The frame numbers marked on the graphs 
correspond to the frame numbers of the image sequence. 



3.1 Results 

To assess the joint performance of the tracker and classification stages, we con- 
ducted two trials. The first trial consisted of the hand region being manually 
outlined in the initial frame and the second trial consisted of the automated ini- 
tial localization scheme outlined in this paper. In the manually segmented trials 
97.13% of the 592 test cases were correctly identified, when considering the top 
two candidate movements classification performance improved to 99.49%. While 
for the automated localization trial an accuracy rate of 86.00% was achieved 
and 91.00% when considering the top two candidates. Further inspection of the 
results found that approximately 14% of the test cases in the automated local- 
ization trial failed to isolate a sufficient region of the hand (i.e. approximately 
50% of the hand). The majority of these cases consisted of the automated lo- 
calization process homing in on the volunteer’s head since the head was the 
dominant moving structure. This is contrary to our assumption that the hand 
is the dominant moving structure in the scene. Treating these cases as failure 
to acquire and omitting them from further analysis resulted in an accuracy rate 
of 91.55% and an accuracy of 95.09% when considering the top two candidates, 
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Table 2. Gesture movement recognition results. The axes of the table represent the 
actual input gesture (vertical) versus the classification result (horizontal). Each cell (i,j) 
in the table holds the percentage of test cases that were actually i but classified as j for 
both manually initialized localized trials (left) and automated initialized localized trials 
(right) (i.e. manual/automated). The diagonal (i,j) (highlighted in bold) represents the 
percentage of the correctly classified gestures. 
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see Table 2. In terms of execution speed, the tracking speed using a Pentium 
4 2.1 GHz processor and unoptimized C code was 8 frames/second; the time 
consumed by all other components was negligible. 



3.2 Discussion 

A current limitation is the automated initial localization process. The majority 
of the failed localization cases were attributed to gross head movements, the 
remaining localization problems occurred with users gesturing with bare arms 
(although most bare arm cases were localized properly) and users wearing skin 
toned clothing. A review of the literature finds that most other related work has 
simplified the initial localization problem through manual segmentation [19,25, 
30], restricting the colours in the scene [17,24,26], restricting the type of cloth- 
ing worn (i.e. long sleeved shirts) [17,24,26], having users hold markers [5], using 
a priori knowledge of initial gesture pose [9,14], and using multiple, specially 
configured cameras [30] or magnetic trackers [6,10,18,30]. In our study, we make 
no assumptions along these lines; nevertheless, our results are competitive with 
those reported elsewhere. Beyond initialization, four failed tracking cases oc- 
curred related to frame-to-frame displacement beyond the capture range of our 
motion estimator. Drift has not been a significant factor in tracking during our 
experiments. This is due to the use of skin colour and change detection masks 
to define the region of support as well as a robust motion estimator to reject 
outliers. Possible solutions to tracking failure include: the use of a higher frame 
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rate camera to decrease interframe motion and/or the use of a motion estimator 
with a larger capture range (e.g., correlation-based, rather than gradient-based 
method) . 

Given acceptable tracking, problems in the classification per se arose from 
non-intentional but significant movements accompanying the intended move- 
ment. For instance, when conducting the “away signer” movement, some of the 
subjects, would rotate the palm of their hand about the camera axis as they 
were moving their hand forward. Systematic analysis of such cases may make it 
possible to improve our feature signatures to encompass such variations. 

It should be noted that to realize the above results we assumed that the 
gestures were temporally segmented. To relax these assumptions future work 
may appeal to detecting discontinuities in the kinematic feature time series to 
temporally segment the gestures (e.g. [23]). 

4 Summary 

We have presented a novel approach to vision-based gesture recognition, based 
on two key concepts. First, we appeal to linguistic theory to represent complex 
gestures in terms of their primitive components. By working with a finite set 
of primitives, which can be combined in a wide variety of ways, our approach 
has the potential to deal with a large vocabulary of gestures. Second, we define 
distinctive signatures for the primitive components that can be recovered from 
monocular image sequences. By working with signatures that can be recovered 
without special purpose equipment, our approach has the potential for use in a 
wide range of human computer interfaces. Using American Sign Language (ASL) 
as a test bed application, we have developed an algorithm for the recognition 
of the primitive contrastive movements (movement phonemes) from which ASL 
symbols are built. The algorithm recovers kinematic features from an input video 
sequence, based on an affine decomposition of the apparent motion (s) across the 
sequence. The recovered feature values affect movement signatures that are used 
in a nearest neighbour recognition system. Empirical evaluation of the algorithm 
suggests its applicability to the analysis of complex gesture videos. 
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Abstract. We understand and reconstruct special surfaces from 3D 
data with line geometry methods. Based on estimated surface normals we 
use approximation techniques in line space to recognize and reconstruct 
rotational, helical, developable and other surfaces, which are character- 
ized by the configuration of locally intersecting surface normals. For the 
computational solution we use a modified version of the Klein model of 
line space. Obvious applications of these methods lie in Reverse Engi- 
neering. We have tested our algorithms on real world data obtained from 
objects as antique pottery, gear wheels, and a surface of the ankle joint. 



Introduction. The geometric viewpoint turned out to be highly successful in 
dealing with a variety of problems in Computer Vision (see, e.g., [3,6,9,15]). 
So far mainly methods of analytic geometry (projective, affine and Euclidean) 
and differential geometry have been used. The present paper suggests to employ 
line geometry as a tool which is both interesting and applicable to a number 
of problems in Computer Vision. Relations between vision and line geometry 
are not entirely new. Recent research on generalized cameras involves sets of 
projection rays which are more general than just bundles [1,7,18,22]. A beautiful 
exposition of the close connections of this research area with line geometry has 
recently been given by T. Pajdla [17]. 

The present paper deals with the problem of understanding and reconstruct- 
ing 3D shapes from 3D data. The data are assumed to be of a surface-like nature 
— either a cloud of measurement points, or another 3D shape representation 
such as a triangular mesh or a surface representation in parametric or implicit 
form — and we assume that we are able to obtain a discrete number of points on 
the surface and to estimate surface normals there. We are interested in classes of 
surfaces with special properties: planar, spherical and cylindrical surfaces, sur- 
faces of revolution and helical surfaces; and the more general surface classes of 
canal , pipe, and developable surfaces. For applications in CAD/CAM it is essen- 
tial that such special shapes are not represented by freeform surfaces without 
regard to their special properties, but treated in a way more appropriate to their 
‘simple’ nature. 

Line geometry enters the problem of object reconstruction from point clouds 
via the surface normals estimated at the data points. In fact, modern 3D pho- 
tography and the corresponding software delivers such normals together with 
the data points. It turns out that the surface classes mentioned above can be 
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characterized in terms of their surface normals in an elegant, simple and compu- 
tationally efficient way. Appropriate coordinates for lines (which yield a model 
of line space as a certain 4-dimensional manifold embedded in K 6 ) will allow to 
classify point clouds and their normals (i.e. , the so-called ‘normal congruence’) 
by means of tools such as principal component analysis. 



Previous work. There is a vast body of literature on surface reconstruction. 
Since we are interested in the reconstruction of special surfaces, we do not review 
the part of literature which deals with the reconstruction of triangular meshes 
or general freeform representations. In Computer Vision, recognition and recon- 
struction of special shapes is often performed by methods related to the Hough 
transform. Originally designed for the detection of straight lines in 2D, it re- 
ceived much attention and has been generalized so as to be able to detect and 
reconstruct many other shapes (see, e.g., [11,14]). Pure Hough transform meth- 
ods work in ‘spaces of shapes’ and quickly lead to high dimensions and reduced 
efficiency. In order to avoid these problems, such tools are sometimes augmented 
by methods from constructive geometry. This approach is already close to tech- 
niques invented by the CAD community, which use geometric characterizations 
of surfaces (e.g. by means of the Gaussian image) for data segmentation and 
the extraction of special shapes (see the survey [24]). Many papers deal with 
axis estimation of rotational surfaces, like [25]. See [8] for an overview on the 
Hough transform, the RANSAC principle, and the least squares approach. In 
the present paper, however, rotational surfaces occur only as a special case. 

The use of line geometry for surface reconstruction has been introduced by 
[20]. There cylinders, surfaces of revolution and helical surfaces are recognized 
by the fact that their surface normals are contained in a so-called linear line 
complex. In particular surfaces which can be moved within themselves in more 
than one way (right circular cylinders, spheres, planes) are detected. The tech- 
nique is extendable to surfaces which may be locally well approximated by the 
surface types mentioned above [2,13,21]. 



Contributions of the present paper. Inspired by the line geometric work on 
reverse engineering of special shapes, our paper presents a broader line geometric 
framework for the solution of problems in 3D shape understanding, segmentation 
and reconstruction: 

• We discuss a point model for line space, namely a certain 4-dimensional al- 
gebraic manifold M 4 of order 4 in R 6 . It is better suited for line geometric 
approximation problems than the classical Klein model or the model used in 
[20], which is limited to linear line complexes. 

• This point model makes it possible to perform the basic shape recognition 
tasks via principal component analysis (PCA) of a point cloud (contained in 
M 4 ), which represents the estimated surface normals of the input shape. This 
procedure is further improved here and unlike [20] is stable in all special cases. 

• The idea of looking for surface normals which intersect makes it possible to 
apply line-geometric methods to the problem of recognition and reconstruction 
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of canal surfaces (which are the envelope of a one-parameter family of spheres), 
and of moulding surfaces. The latter denotes a certain class of sweep surfaces 
which contains the developable surfaces and the pipe surfaces. 

• The segmentation of composite surfaces into their ‘simple’ parts is addressed 
here in so far as recognition of surface type is essential for segmentation. Our 
algorithms may be included as a ‘black box’ into a segmentation algorithm. 

1 The 4-Dimensional Manifold of Lines in M 6 

This paragraph discusses a computationally attractive point model of (i.e., co- 
ordinates in) the 4-dimensional manifold of straight lines in space. It is closely 
related to the classical Klein quadric (i.e., Pliicker coordinates). We think of an 
oriented line L as one equipped with a unit vector 1 indicating the direction of 
the line — so that there are two oriented lines for each line of space. Then L is 
determined by 1 and the moment vector 1, which is computed by means of an 
arbitrary point x on L as 1 = x x 1. 1 is independent of the choice of xonl. The 
six numbers (1,1) are called normalized Pliicker coordinates of L. ‘Normalized’ 
means that ||1|| = 1; further, they satisfy the orthogonality condition 1-1 = 0. 
Conversely, any two vectors 1,1 £ R 3 which satisfy these two conditions deter- 
mine a unique oriented straight line L in R 3 , which has (1,1) as its normalized 
Pliicker coordinates. 

If we do not distinguish between the two opposite orientations of the same 
line, we may use all multiples of the pair (1,1) as coordinates of a line. Of course, 
we still have the condition 1-1 = 0. Such homogeneous coordinate vectors of lines 
represent those points of five-dimensional projective space which are contained 
in the Klein quadric given by the equation (x,x) £ -O- x • x = 0. This 

interpretation of lines is well studied in classical geometry, see [21] . 

The present paper pursues the following approach, which is closely related to 
the Klein quadric. We use only normalized coordinate vectors, and so we identify 
an oriented line with the point (1,1) in six-dimensional Euclidean space R 6 . In 
this way we obtain a mapping a of oriented lines to points of a 4-dimensional 
manifold M 4 C R 6 . M 4 is algebraic of degree 4, and is the intersection of the 
cylinder Z 5 and the cone T 5 defined by 

Z 5 :x 2 = 1, r 5 : x • x = 0. 

We use the Euclidean distance of points in R 6 in order to measure distances 
between oriented lines G,H : If Ga = (g,g) and Ha= (h, h), then 

d(G, H) 2 = (g — h) 2 + (g — h) 2 . (1) 

The reasons why we prefer this distance function are the following: On the one 
hand, it is quadratic and thus lends itself to minimization. On the other hand, 
in a neighbourhood of the origin of the coordinate system, (1) models a distance 
of lines which is in accordance with visualization. The slight drawback that this 
is no longer the case in regions far away from the origin is in fact not important 
in most applications, as such applications very often define a natural region of 
interest , where we can put the origin of our coordinate system. 
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Remark 1. Another method for introducing coordinates and measuring distances 
for lines, which has been used in the past, is to fix two parallel planes and 
describe a line by the two intersection points with that plane (cf. [21]). This 
leads to simpler formulae and a region of interest which is bounded by the two 
fixed planes (in our case, this region is a sphere centered in the origin). 



Classification of Surfaces by Normal Congruences 

The set of normals of a surface is called its normal congruence. Some surface 
types are easily recognized from properties of their normal congruence: The 
normals of a sphere pass through its center (they constitute a bundle with a finite 
vertex), and the normals of a plane are parallel (they constitute a parallel bundle, 
with vertex at infinity). These are the simplest examples; there are however 
other interesting and practically important classes of surfaces which are nicely 
characterized by their normal congruence. These classical results are basic to 
shape understanding and reconstruction algorithms and thus we summarize them 
in this section. 

A uniform motion in 3-space, composed of a uniform rotation of unit angular 
velocity about an axis and a translation of constant speed p along this axis is 
called a helical motion of pitch p. If we choose a Cartesian coordinate system 
with the 23 -axis being the axis of rotation, then the point (xi, 22 , £ 3 ) will move 
according to 

Xi(t) = X\ cost — X 2 sin t, 22 (f) = X\ sin t, + x 2 cost, 23 (f) = £3 + pt. (2) 

In the case p = 0 we have a uniform rotation. For p — > 00 we get, in the 
limit, a uniform translation. A surface swept by a curve under a helical motion 
is called helical surface. As special and limit cases for p = 0 and p = 00 we 
get surfaces of revolution and cylinder surfaces , respectively. We say that these 
surfaces are kinematically generated. However if we speak of a helical surface we 
always mean part of a complete helical surface as defined above, and analogously 
for other adjectives, like rotational/cylindrical/spherical/planar. Closely related 
to helical motions are linear complexes, which are certain three-parameter sets 
of lines defined by linear equations, and which are discussed in more detail in 
[21]. A line L with Pliicker coordinates ( 1 , 1 ) is contained in the complex C with 
coordinates (c, c) if and only if 

LgC c • 1 + c I = 0. (3) 

Obviously the lines of the complex C defined by (3) correspond to those points 
of K 6 which are both contained in M 4 and fulfill (3), i.e., they lie in M 4 and in 
the hyperplane H 5 : c-x + c-x = 0. Note that H 5 passes through the origin. The 
set Col is a certain 3-dimensional manifold. A linear complex C is called singular 
if c • c = 0 (then there is a line A with Aa = (c,c), and C consists of all lines 
which intersect A). 



